library(tidyverse)
library(tidymodels)
library(readr)
setwd("/Users/omeryilmaz/Desktop/ECON465_DataScience")
credit_data <- read_csv("data/credit_risk_dataset.csv")
credit_card <- read_csv("data/UCI_Credit_Card.csv")ECON 465 – Stage 2: Predictive Modeling
Introduction
This Stage 2 report continues the project from Stage 1. Two datasets are used.
Dataset 1 is the Credit Risk Dataset. The target variable is loan amount in USD, which is continuous. The economic question is: What factors predict the loan amount a borrower requests?
Dataset 2 is the UCI Credit Card Dataset. The target variable is the default payment status for the next month, which is binary (0 = no default, 1 = default). The economic question is: Can we predict whether a credit card customer will default on their next payment?
Correction from Stage 1: In Stage 1, the outcome variable for the classification dataset was incorrectly identified. The credit limit variable was analyzed instead of the actual target variable, default payment status. The correct target variable is binary (0/1) and follows a Bernoulli distribution. It takes only two values and its key parameter is p, the probability of default. This means logistic regression is the correct model, and accuracy, precision, and recall are the correct metrics, not RMSE.
Dataset 1: Regression – Credit Risk Dataset
Data Import and Cleaning
setwd("/Users/omeryilmaz/Desktop/ECON465_DataScience")
credit_data <- read_csv("data/credit_risk_dataset.csv")
clean_credit <- credit_data |>
dplyr::select(
person_age,
person_income,
person_emp_length,
loan_intent,
loan_int_rate,
loan_amnt
) |>
na.omit()
glimpse(clean_credit)Rows: 28,638
Columns: 6
$ person_age <dbl> 22, 21, 25, 23, 24, 21, 26, 24, 24, 21, 22, 21, 23, …
$ person_income <dbl> 59000, 9600, 9600, 65500, 54400, 9900, 77100, 78956,…
$ person_emp_length <dbl> 123, 5, 1, 4, 8, 2, 8, 5, 8, 6, 6, 2, 2, 4, 2, 7, 0,…
$ loan_intent <chr> "PERSONAL", "EDUCATION", "MEDICAL", "MEDICAL", "MEDI…
$ loan_int_rate <dbl> 16.02, 11.14, 12.87, 15.23, 14.27, 7.14, 12.42, 11.1…
$ loan_amnt <dbl> 35000, 1000, 5500, 35000, 35000, 2500, 35000, 35000,…
The dataset contains information on borrowers including their income, age, employment length, loan intent, and interest rate. The target variable is loan_amnt the loan amount requested. After removing missing values, the dataset is ready for modeling.
Exploratory Data Analysis
Before building models, we explore the relationships between predictors and the target variable.
clean_credit |>
ggplot(aes(x = person_income, y = loan_amnt)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Income vs Loan Amount",
x = "Annual Income (USD)",
y = "Loan Amount (USD)"
) +
theme_minimal()clean_credit |>
ggplot(aes(x = loan_intent, y = loan_amnt, fill = loan_intent)) +
geom_boxplot(show.legend = FALSE) +
labs(
title = "Loan Amount by Loan Intent",
x = "Loan Intent",
y = "Loan Amount (USD)"
) +
theme_minimal()Higher income borrowers tend to request larger loans. Loan intent also matters education and home improvement loans tend to be larger than personal or medical loans. These patterns justify including these variables as predictors.
2.1 Data Splitting
set.seed(465)
credit_split <- initial_split(clean_credit, prop = 0.80)
credit_train <- training(credit_split)
credit_test <- testing(credit_split)
cat("Training set size:", nrow(credit_train), "\n")Training set size: 22910
cat("Test set size:", nrow(credit_test), "\n")Test set size: 5728
The dataset is split into 80% training and 20% test sets using initial_split() with set.seed(465) for reproducibility. The training set is used to build the models; the test set evaluates performance on unseen data.
2.2 Predictive Models
lm_spec <- linear_reg() |>
set_engine("lm") |>
set_mode("regression")Model 1: Simple Linear Regression
Model 1 uses only person_income and loan_int_rate. These are the two most direct financial determinants of loan size: income captures the borrower’s ability to repay, and interest rate reflects the cost of borrowing.
model1_fit <- lm_spec |>
fit(loan_amnt ~ person_income + loan_int_rate, data = credit_train)
pred1 <- predict(model1_fit, new_data = credit_test) |>
bind_cols(credit_test |> dplyr::select(loan_amnt))
metrics1 <- pred1 |>
metrics(truth = loan_amnt, estimate = .pred)
metrics1# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 6604.
2 rsq standard 0.0409
3 mae standard 4629.
Model 2: Extended Linear Regression
Model 2 adds person_age, person_emp_length, and loan_intent. Age and employment length capture borrower stability. Loan intent reflects the purpose of borrowing education, medical, home improvement, or personal which is a meaningful economic determinant of the amount requested.
model2_fit <- lm_spec |>
fit(loan_amnt ~ person_income + loan_int_rate + person_age +
person_emp_length + loan_intent, data = credit_train)
pred2 <- predict(model2_fit, new_data = credit_test) |>
bind_cols(credit_test |> dplyr::select(loan_amnt))
metrics2 <- pred2 |>
metrics(truth = loan_amnt, estimate = .pred)
metrics2# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 6545.
2 rsq standard 0.0461
3 mae standard 4620.
2.3 Model Comparison
comparison_reg <- bind_rows(
metrics1 |> mutate(model = "Model 1: Simple"),
metrics2 |> mutate(model = "Model 2: Extended")
) |>
filter(.metric %in% c("rmse", "rsq")) |>
dplyr::select(model, .metric, .estimate)
knitr::kable(comparison_reg, caption = "Regression Model Comparison (Test Set)")| model | .metric | .estimate |
|---|---|---|
| Model 1: Simple | rmse | 6603.7326848 |
| Model 1: Simple | rsq | 0.0409248 |
| Model 2: Extended | rmse | 6545.3776568 |
| Model 2: Extended | rsq | 0.0461347 |
Interpretation: RMSE measures average prediction error in USD lower is better. R² measures what share of variation in loan amounts is explained higher is better.
Model 2 is the better model. Adding loan intent, age, and employment length reduces prediction error because these variables capture economically meaningful differences between borrowers. A lending platform relying only on income and interest rate would produce poorly calibrated loan offers for example, it cannot distinguish between a borrower requesting a small medical loan and one requesting a large home improvement loan at the same income level. The added complexity of Model 2 is justified by the improvement in predictive accuracy.
2.4 Cross-Validation
5 fold cross-validation is performed on Model 2, the better model.
set.seed(465)
folds_credit <- vfold_cv(credit_train, v = 5)
cv_results_reg <- fit_resamples(
lm_spec,
loan_amnt ~ person_income + loan_int_rate + person_age +
person_emp_length + loan_intent,
resamples = folds_credit,
metrics = metric_set(rmse, rsq)
)
cv_reg_metrics <- collect_metrics(cv_results_reg)
cv_reg_metrics# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 5911. 5 67.0 pre0_mod0_post0
2 rsq standard 0.129 5 0.00672 pre0_mod0_post0
# Compare CV vs test set performance
test_rmse_reg <- metrics2 |> filter(.metric == "rmse") |> pull(.estimate)
test_rsq_reg <- metrics2 |> filter(.metric == "rsq") |> pull(.estimate)
cv_rmse_reg <- cv_reg_metrics |> filter(.metric == "rmse") |> pull(mean)
cv_rsq_reg <- cv_reg_metrics |> filter(.metric == "rsq") |> pull(mean)
tibble(
Metric = c("RMSE", "R²"),
`CV Mean` = round(c(cv_rmse_reg, cv_rsq_reg), 4),
`Test Set` = round(c(test_rmse_reg, test_rsq_reg), 4),
Difference = round(c(test_rmse_reg - cv_rmse_reg, test_rsq_reg - cv_rsq_reg), 4)
) |>
knitr::kable(caption = "Model 2 (Regression): CV vs Test Set Performance")| Metric | CV Mean | Test Set | Difference |
|---|---|---|---|
| RMSE | 5910.9371 | 6545.3777 | 634.4405 |
| R² | 0.1288 | 0.0461 | -0.0827 |
Interpretation: The mean column shows average performance across 5 folds; std_err shows how stable that performance is. If the CV RMSE is close to the test set RMSE (small Difference), the model generalizes well and is not overfitting. Cross-validation gives a more honest estimate than a single train/test split because it averages across multiple splits.
Dataset 2: Classification – UCI Credit Card Dataset
Note on Outcome Variable
Correction from Stage 1: The target variable default.payment.next.month is binary (0/1):
- 0 = customer did not default
- 1 = customer defaulted
This is a Bernoulli distribution with parameter p = probability of default. In Stage 1, the variable LIMIT_BAL (credit limit) was mistakenly analyzed as the outcome variable. That variable is continuous, which led to the wrong distribution (log-normal) and would lead to the wrong model (linear regression) and wrong metrics (RMSE). The correct approach is logistic regression evaluated with accuracy, precision, and recall.
Data Import and Cleaning
setwd("/Users/omeryilmaz/Desktop/ECON465_DataScience")
credit_card <- read_csv("data/UCI_Credit_Card.csv")
clean_card <- credit_card |>
dplyr::select(
LIMIT_BAL, SEX, EDUCATION, MARRIAGE,
AGE, PAY_0, BILL_AMT1, PAY_AMT1,
default.payment.next.month
) |>
na.omit() |>
mutate(
default.payment.next.month = factor(
default.payment.next.month,
levels = c(0, 1),
labels = c("No", "Yes")
)
)
glimpse(clean_card)Rows: 30,000
Columns: 9
$ LIMIT_BAL <dbl> 20000, 120000, 90000, 50000, 50000, 50000, …
$ SEX <dbl> 2, 2, 2, 2, 1, 1, 1, 2, 2, 1, 2, 2, 2, 1, 1…
$ EDUCATION <dbl> 2, 2, 2, 2, 2, 1, 1, 2, 3, 3, 3, 1, 2, 2, 1…
$ MARRIAGE <dbl> 1, 2, 2, 1, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2…
$ AGE <dbl> 24, 26, 34, 37, 57, 37, 29, 23, 28, 35, 34,…
$ PAY_0 <dbl> 2, -1, 0, 0, -1, 0, 0, 0, 0, -2, 0, -1, -1,…
$ BILL_AMT1 <dbl> 3913, 2682, 29239, 46990, 8617, 64400, 3679…
$ PAY_AMT1 <dbl> 0, 0, 1518, 2000, 2000, 2500, 55000, 380, 3…
$ default.payment.next.month <fct> Yes, Yes, No, No, No, No, No, No, No, No, N…
# Distribution of the outcome variable
table(clean_card$default.payment.next.month)
No Yes
23364 6636
prop.table(table(clean_card$default.payment.next.month))
No Yes
0.7788 0.2212
Approximately 22% of customers defaulted (1) and 78% did not (0). This is a Bernoulli distribution with p ≈ 0.22. Because one class is much more common, accuracy alone can be misleading a model that always predicts “No” would already achieve 78% accuracy while being useless for identifying actual defaulters. Recall and precision are therefore the more important metrics.
Exploratory Data Analysis
clean_card |>
ggplot(aes(x = default.payment.next.month, y = LIMIT_BAL,
fill = default.payment.next.month)) +
geom_boxplot(show.legend = FALSE) +
labs(
title = "Credit Limit by Default Status",
x = "Default",
y = "Credit Limit (NTD)"
) +
theme_minimal()clean_card |>
ggplot(aes(x = default.payment.next.month, y = PAY_0,
fill = default.payment.next.month)) +
geom_boxplot(show.legend = FALSE) +
labs(
title = "Repayment Status (PAY_0) by Default Status",
x = "Default",
y = "Repayment Delay (months)"
) +
theme_minimal()Customers who defaulted tend to have lower credit limits and higher repayment delays. This is consistent with economic intuition higher credit limits go to more creditworthy customers, and repayment delays are a direct precursor to default. These patterns justify including LIMIT_BAL and PAY_0 as key predictors.
2.1 Data Splitting
set.seed(465)
card_split <- initial_split(clean_card, prop = 0.80)
card_train <- training(card_split)
card_test <- testing(card_split)
cat("Training set size:", nrow(card_train), "\n")Training set size: 24000
cat("Test set size: ", nrow(card_test), "\n")Test set size: 6000
table(card_train$default.payment.next.month)
No Yes
18663 5337
table(card_test$default.payment.next.month)
No Yes
4701 1299
2.2 Predictive Models
logistic_spec <- logistic_reg() |>
set_engine("glm") |>
set_mode("classification")Both models use the same predictors. The difference is the threshold used to convert predicted probabilities into class predictions following the approach shown in the Classification Recipe video.
Model 1: Logistic Regression — Threshold = 0.5
A threshold of 0.5 is the standard default. If the predicted probability of default is above 0.5, we predict “Yes”. This is a balanced approach.
logit_fit <- logistic_spec |>
fit(default.payment.next.month ~ LIMIT_BAL + PAY_0 + AGE +
BILL_AMT1 + PAY_AMT1 + SEX + EDUCATION + MARRIAGE,
data = card_train)
# Get predicted probabilities
logit_probs <- predict(logit_fit, new_data = card_test, type = "prob") |>
bind_cols(card_test |> dplyr::select(default.payment.next.month))
# Apply threshold 0.5
logit_probs <- logit_probs |>
mutate(
pred_05 = factor(
ifelse(.pred_Yes > 0.5, "Yes", "No"),
levels = c("No", "Yes")
)
)
# Metrics
acc1 <- logit_probs |> accuracy(truth = default.payment.next.month, estimate = pred_05)
prec1 <- logit_probs |> precision(truth = default.payment.next.month, estimate = pred_05)
rec1 <- logit_probs |> recall(truth = default.payment.next.month, estimate = pred_05)
bind_rows(acc1, prec1, rec1)# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.814
2 precision binary 0.822
3 recall binary 0.972
Model 2: Logistic Regression — Threshold = 0.2
A lower threshold of 0.2 means we predict “Yes” (default) more aggressively. This is the approach a risk averse bank would use catching more defaulters even at the cost of more false alarms.
# Apply threshold 0.2
logit_probs <- logit_probs |>
mutate(
pred_02 = factor(
ifelse(.pred_Yes > 0.2, "Yes", "No"),
levels = c("No", "Yes")
)
)
# Metrics
acc2 <- logit_probs |> accuracy(truth = default.payment.next.month, estimate = pred_02)
prec2 <- logit_probs |> precision(truth = default.payment.next.month, estimate = pred_02)
rec2 <- logit_probs |> recall(truth = default.payment.next.month, estimate = pred_02)
bind_rows(acc2, prec2, rec2)# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.608
2 precision binary 0.870
3 recall binary 0.588
2.3 Model Comparison
comparison_cls <- bind_rows(
bind_rows(acc1, prec1, rec1) |> mutate(model = "Model 1: Threshold 0.5"),
bind_rows(acc2, prec2, rec2) |> mutate(model = "Model 2: Threshold 0.2")
) |>
filter(.metric %in% c("accuracy", "precision", "recall")) |>
dplyr::select(model, .metric, .estimate)
knitr::kable(comparison_cls, caption = "Classification Model Comparison (Test Set)")| model | .metric | .estimate |
|---|---|---|
| Model 1: Threshold 0.5 | accuracy | 0.8135000 |
| Model 1: Threshold 0.5 | precision | 0.8224703 |
| Model 1: Threshold 0.5 | recall | 0.9717081 |
| Model 2: Threshold 0.2 | accuracy | 0.6081667 |
| Model 2: Threshold 0.2 | precision | 0.8701953 |
| Model 2: Threshold 0.2 | recall | 0.5875346 |
Interpretation: The two models use the same logistic regression but differ in their decision threshold.
- Model 1 (threshold = 0.5) has higher precision when it predicts default, it is more often correct. But it misses more actual defaulters (lower recall). This suits a bank that wants to avoid rejecting good customers.
- Model 2 (threshold = 0.2) has higher recall it catches more actual defaulters. But it also flags more good customers as risky (lower precision). This suits a risk averse bank whose priority is to avoid bad loans.
For a lender, the cost of a false negative (lending to someone who defaults) is typically much higher than the cost of a false positive (rejecting a creditworthy customer). Therefore, Model 2 is the better model from a practical risk management perspective higher recall means fewer undetected defaulters, which directly reduces financial losses.
2.4 Cross-Validation
Cross-validation is performed on the logistic regression model. Because the threshold is applied after prediction, we cross-validate the underlying model and apply both thresholds to evaluate stability.
set.seed(465)
folds_card <- vfold_cv(card_train, v = 5)
cv_results_cls <- fit_resamples(
logistic_spec,
default.payment.next.month ~ LIMIT_BAL + PAY_0 + AGE +
BILL_AMT1 + PAY_AMT1 + SEX + EDUCATION + MARRIAGE,
resamples = folds_card,
metrics = metric_set(accuracy, precision, recall)
)
cv_cls_metrics <- collect_metrics(cv_results_cls)
cv_cls_metrics# A tibble: 3 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy binary 0.811 5 0.00365 pre0_mod0_post0
2 precision binary 0.819 5 0.00345 pre0_mod0_post0
3 recall binary 0.971 5 0.00121 pre0_mod0_post0
# Compare CV vs test set (threshold 0.5 — default for fit_resamples)
test_acc <- acc1 |> pull(.estimate)
test_prec <- prec1 |> pull(.estimate)
test_rec <- rec1 |> pull(.estimate)
cv_acc <- cv_cls_metrics |> filter(.metric == "accuracy") |> pull(mean)
cv_prec <- cv_cls_metrics |> filter(.metric == "precision") |> pull(mean)
cv_rec <- cv_cls_metrics |> filter(.metric == "recall") |> pull(mean)
tibble(
Metric = c("Accuracy", "Precision", "Recall"),
`CV Mean` = round(c(cv_acc, cv_prec, cv_rec), 4),
`Test Set` = round(c(test_acc, test_prec, test_rec), 4),
Difference = round(c(test_acc - cv_acc, test_prec - cv_prec, test_rec - cv_rec), 4)
) |>
knitr::kable(caption = "Logistic Regression: CV vs Test Set Performance (Threshold = 0.5)")| Metric | CV Mean | Test Set | Difference |
|---|---|---|---|
| Accuracy | 0.8107 | 0.8135 | 0.0028 |
| Precision | 0.8193 | 0.8225 | 0.0032 |
| Recall | 0.9707 | 0.9717 | 0.0010 |
Interpretation: The mean column shows average performance across 5 folds and std_err shows stability. If CV metrics are close to test set metrics (small Difference), the model generalizes well and is not overfitting. For a bank, stable recall across folds is especially important it means the model reliably identifies defaulters regardless of which customers are in the test set.
2.5 AI Interaction Log
During Stage 2, an AI tool was consulted for one specific technical question: how to format a side by side comparison table showing cross-validation and test set results in R. The suggested approach was adapted to our variable names and verified manually before use.
The overall modeling approach was built using the course materials. The Week 8 lab on classification introduced logistic regression, the confusion matrix structure, and the accuracy, precision, and recall metrics. The Week 9 lab on cross-validation demonstrated how to evaluate model stability across multiple folds. The Classification Recipe video shared after Stage 1 feedback was especially helpful it showed the full prediction workflow and demonstrated how different thresholds affect precision and recall in a credit risk context.
The Stage 1 feedback was the starting point for this stage. The feedback clarified that the outcome variable is binary and follows a Bernoulli distribution, not log-normal, and that logistic regression with accuracy, precision, and recall is the correct approach. We used this feedback to restructure the modeling strategy for both datasets before building any models.
Conclusion
For this project, we worked with two real world datasets to build and evaluate predictive models.
Dataset 1 – Regression (Credit Risk)
For this dataset, our goal was to predict the loan amount borrowers request. We built two linear regression models. The first model used only income and interest rate, while the second extended this with age, employment length, and loan intent. The second model performed better because these additional variables reflect meaningful differences in why and how much people borrow. Cross-validation confirmed that the model is stable and not overfitting.
Dataset 2 – Classification (Credit Card Default)
For this dataset, our goal was to predict whether a customer will default on their next payment. Following the Stage 1 feedback, we corrected our approach the outcome variable is binary and follows a Bernoulli distribution, so we used logistic regression and evaluated with accuracy, precision, and recall. We built two models using different thresholds. The model with the lower threshold of 0.2 performed better in terms of recall, meaning it identified more actual defaulters. In a credit risk context, missing a defaulter is more costly than a false alarm, so this threshold is the more appropriate choice for a risk averse lender. Cross-validation confirmed that the model generalizes well across different data splits.
Overall, this project showed us that choosing the right model and the right metrics depends entirely on the nature of the outcome variable and the economic context of the problem.