ECON 465 – Stage 3 Final Report: What Predicts Loan Amount?

Author

Efe Şahin - Ömer Faruk Yılmaz

Economic Question

What factors predict the loan amount a borrower requests?

Access to credit is a fundamental mechanism in modern economies: it enables households to invest in housing, education, and consumption smoothing. The key question here is not only whether we can predict loan size, but what borrower characteristics actually drive loan demand. If income alone dominates, lenders operate on a narrow repayment capacity model. If loan purpose and employment stability add independent predictive power, lenders are incorporating a richer assessment of borrower needs. Understanding this distinction matters for credit market efficiency: a lender that ignores loan purpose may systematically misprice or missize loan offers.

Data

Source and Variables

The dataset is the Credit Risk Dataset obtained from Kaggle. It contains real borrower level information on loan applicants including income, age, employment length, loan intent, interest rate, and loan amount. We chose this dataset because loan amount prediction is a directly relevant economic problem understanding what drives borrowing demand matters for both lenders and policymakers. The dataset provides enough variation in loan intent, income, and borrower characteristics to meaningfully test whether loan purpose adds independent predictive power beyond standard financial indicators.

Source: Kaggle — Credit Risk Dataset

Observations: 32,581 after cleaning

Target variable: loan_amnt — continuous, loan amount in USD

Predictors: annual income, interest rate, age, employment length, loan intent

Import and Cleaning

credit_data <- read_csv("credit_risk_dataset.csv")

clean_credit <- credit_data |>
  select(
    person_age,
    person_income,
    person_emp_length,
    loan_intent,
    loan_int_rate,
    loan_amnt
  ) |>
  na.omit()

glimpse(clean_credit)
Rows: 28,638
Columns: 6
$ person_age        <dbl> 22, 21, 25, 23, 24, 21, 26, 24, 24, 21, 22, 21, 23, …
$ person_income     <dbl> 59000, 9600, 9600, 65500, 54400, 9900, 77100, 78956,…
$ person_emp_length <dbl> 123, 5, 1, 4, 8, 2, 8, 5, 8, 6, 6, 2, 2, 4, 2, 7, 0,…
$ loan_intent       <chr> "PERSONAL", "EDUCATION", "MEDICAL", "MEDICAL", "MEDI…
$ loan_int_rate     <dbl> 16.02, 11.14, 12.87, 15.23, 14.27, 7.14, 12.42, 11.1…
$ loan_amnt         <dbl> 35000, 1000, 5500, 35000, 35000, 2500, 35000, 35000,…
nrow(clean_credit)
[1] 28638

Variables were selected and rows with missing values were removed with na.omit(). The final dataset is tidy: each row is one loan application and each column is one variable.

Probability Analysis

Summary Statistics

clean_credit |>
  summarise(
    N       = n(),
    Mean    = round(mean(loan_amnt), 0),
    Median  = round(median(loan_amnt), 0),
    SD      = round(sd(loan_amnt), 0),
    Min     = min(loan_amnt),
    Max     = max(loan_amnt)
  ) |>
  knitr::kable(caption = "Summary Statistics: Loan Amount (USD)")
Summary Statistics: Loan Amount (USD)
N Mean Median SD Min Max
28638 9656 8000 6330 500 35000

The mean loan amount is higher than the median, indicating a right-skewed distribution. The standard deviation is large relative to the mean, reflecting considerable variation in loan sizes across borrowers.

Distribution of Loan Amount

The histogram below shows how loan amounts are distributed across all borrowers in the dataset. The shape of the distribution helps us identify the appropriate theoretical distribution and informs our modeling choices.

ggplot(clean_credit, aes(x = loan_amnt)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  labs(
    title = "Distribution of Loan Amount",
    x = "Loan Amount (USD)",
    y = "Count"
  ) +
  theme_minimal()

Log Transformation

To address the right-skewed distribution, we apply a log transformation to the loan amount variable. Log transformation is a standard technique for financial data: it compresses large values, reduces the effect of outliers, and often reveals a more symmetric, bell-shaped distribution.

clean_credit <- clean_credit |>
  mutate(log_loan = log(loan_amnt))

ggplot(clean_credit, aes(x = log_loan)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  labs(
    title = "Distribution of Log Loan Amount",
    x = "Log Loan Amount",
    y = "Count"
  ) +
  theme_minimal()

The original distribution is right skewed most borrowers request small loans while few request very large amounts. After applying a log transformation, the distribution becomes approximately normal. This suggests that loan amount follows a log-normal distribution, which is consistent with economic theory: financial quantities that are strictly positive and grow multiplicatively tend to be log normally distributed.

Loan Amount by Intent

clean_credit |>
  ggplot(aes(x = loan_intent, y = loan_amnt, fill = loan_intent)) +
  geom_boxplot(show.legend = FALSE) +
  labs(
    title = "Loan Amount by Loan Intent",
    x = "Loan Intent",
    y = "Loan Amount (USD)"
  ) +
  theme_minimal()

The boxplot reveals clear differences in loan amount across loan intent categories. Home improvement and debt consolidation loans tend to be larger than personal or medical loans. This pattern directly motivates including loan intent as a predictor in the models.

Modeling

Data Splitting

set.seed(465)

credit_split <- initial_split(clean_credit, prop = 0.80)
credit_train <- training(credit_split)
credit_test  <- testing(credit_split)

cat("Training set size:", nrow(credit_train), "\n")
Training set size: 22910 
cat("Test set size:    ", nrow(credit_test), "\n")
Test set size:     5728 

The dataset was split into 80% training and 20% test sets using initial_split() with set.seed(465) for reproducibility. The training set is used to estimate the model and the test set evaluates performance on unseen data. The log transformation was used for distribution analysis only; the models use the original loan_amnt variable as the outcome.

Model Specification

Both models use the same linear regression specification, defined once here using the tidymodels framework.

lm_spec <- linear_reg() |>
  set_engine("lm") |>
  set_mode("regression")

We use linear regression because the target variable is continuous. It models loan amount as a linear function of predictors and produces directly interpretable coefficients.

Model 1 — Core Financial Indicators

model1_fit <- lm_spec |>
  fit(loan_amnt ~ person_income + loan_int_rate, data = credit_train)

tidy(model1_fit)
# A tibble: 3 × 5
  term           estimate  std.error statistic   p.value
  <chr>             <dbl>      <dbl>     <dbl>     <dbl>
1 (Intercept)   3888.     148.            26.3 3.32e-150
2 person_income    0.0375   0.000736      51.0 0        
3 loan_int_rate  295.      12.1           24.4 2.24e-129

Rationale: This baseline model uses only the two most direct financial indicators income and interest rate. Income captures the borrower’s ability to repay. Interest rate reflects the cost of borrowing and is also correlated with lender assessed risk. This model tests whether standard financial metrics alone are sufficient to predict how much a borrower requests.

Predictions and Evaluation Metrics

pred1 <- predict(model1_fit, new_data = credit_test) |>
  bind_cols(credit_test |> select(loan_amnt))

metrics1 <- pred1 |>
  metrics(truth = loan_amnt, estimate = .pred)

metrics1
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard   6604.    
2 rsq     standard      0.0409
3 mae     standard   4629.    

Model 2 — Extended Borrower Profile

model2_fit <- lm_spec |>
  fit(loan_amnt ~ person_income + loan_int_rate + person_age +
        person_emp_length + loan_intent, data = credit_train)

tidy(model2_fit)
# A tibble: 10 × 5
   term                        estimate  std.error statistic   p.value
   <chr>                          <dbl>      <dbl>     <dbl>     <dbl>
 1 (Intercept)                3531.     240.         14.7    1.34e- 48
 2 person_income                 0.0361   0.000749   48.2    0        
 3 loan_int_rate               302.      12.1        25.0    1.27e-135
 4 person_age                   -5.14     6.39       -0.805  4.21e-  1
 5 person_emp_length           109.       9.54       11.4    3.47e- 30
 6 loan_intentEDUCATION        -60.5    131.         -0.462  6.44e-  1
 7 loan_intentHOMEIMPROVEMENT  397.     153.          2.60   9.39e-  3
 8 loan_intentMEDICAL         -184.     133.         -1.38   1.67e-  1
 9 loan_intentPERSONAL          -8.57   136.         -0.0631 9.50e-  1
10 loan_intentVENTURE           -3.95   135.         -0.0292 9.77e-  1

Rationale: This extended model adds age, employment length, and loan intent. Age and employment length capture borrower stability and financial experience older and more stably employed borrowers may request larger loans reflecting greater creditworthiness. Loan intent captures the purpose of borrowing, which is the central variable of interest in this analysis. This model tests whether a richer borrower profile improves predictions beyond income and interest rate alone.

Predictions and Evaluation Metrics

pred2 <- predict(model2_fit, new_data = credit_test) |>
  bind_cols(credit_test |> select(loan_amnt))

metrics2 <- pred2 |>
  metrics(truth = loan_amnt, estimate = .pred)

metrics2
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard   6545.    
2 rsq     standard      0.0461
3 mae     standard   4620.    

Model Comparison

rmse1 <- metrics1 |> filter(.metric == "rmse") |> pull(.estimate)
rsq1  <- metrics1 |> filter(.metric == "rsq")  |> pull(.estimate)
rmse2 <- metrics2 |> filter(.metric == "rmse") |> pull(.estimate)
rsq2  <- metrics2 |> filter(.metric == "rsq")  |> pull(.estimate)

tibble(
  Model = c("Model 1: Core Financial (2 vars)", "Model 2: Extended Profile (5 vars) — SELECTED"),
  RMSE  = round(c(rmse1, rmse2), 2),
  R2    = round(c(rsq1, rsq2), 4)
) |>
  knitr::kable(caption = "Model Comparison (Test Set)")
Model Comparison (Test Set)
Model RMSE R2
Model 1: Core Financial (2 vars) 6603.73 0.0409
Model 2: Extended Profile (5 vars) — SELECTED 6545.38 0.0461

Model Selection: Why Model 2?

Model 2 is selected as the final model. It achieves lower RMSE and higher R-squared on the test set compared to Model 1.

The logic is straightforward: if loan intent, age, and employment length carry genuine predictive information about loan size beyond income and interest rate, Model 2 should outperform Model 1 on the held out test set. It does. This confirms that the purpose of borrowing is not captured by income or interest rate alone it adds independent signal. A lender that prices or sizes loans based only on income may systematically mismatch offers for borrowers with specific purposes such as home improvement or debt consolidation.

Cross-Validation

To assess whether Model 2 generalizes well to new data, we perform 5-fold cross validation on the training set. This gives a more reliable estimate of model performance than a single train/test split because it averages results across five different validation sets.

set.seed(465)

folds_credit <- vfold_cv(credit_train, v = 5)

cv_results <- fit_resamples(
  lm_spec,
  loan_amnt ~ person_income + loan_int_rate + person_age +
    person_emp_length + loan_intent,
  resamples = folds_credit,
  metrics = metric_set(rmse, rsq)
)

cv_metrics <- collect_metrics(cv_results)
cv_metrics
# A tibble: 2 × 6
  .metric .estimator     mean     n  std_err .config        
  <chr>   <chr>         <dbl> <int>    <dbl> <chr>          
1 rmse    standard   5911.        5 67.0     pre0_mod0_post0
2 rsq     standard      0.129     5  0.00672 pre0_mod0_post0
cv_rmse <- cv_metrics |> filter(.metric == "rmse") |> pull(mean)
cv_rsq  <- cv_metrics |> filter(.metric == "rsq")  |> pull(mean)

tibble(
  Metric     = c("RMSE", "R2"),
  CV_Mean    = round(c(cv_rmse, cv_rsq), 4),
  Test_Set   = round(c(rmse2, rsq2), 4),
  Difference = round(c(rmse2 - cv_rmse, rsq2 - cv_rsq), 4)
) |>
  knitr::kable(caption = "Model 2: CV vs Test Set Performance")
Model 2: CV vs Test Set Performance
Metric CV_Mean Test_Set Difference
RMSE 5910.9371 6545.3777 634.4405
R2 0.1288 0.0461 -0.0827

The cross-validated RMSE and R-squared are close to the test set values, confirming that Model 2 is stable and not overfitting. The small Difference values indicate that model performance does not depend on any particular train/test split. The std_err values in the CV output are small, showing consistent performance across all five folds.

Results

Economic Interpretation

tidy(model2_fit) |>
  filter(p.value < 0.05) |>
  mutate(
    stars = case_when(
      p.value < 0.001 ~ "***",
      p.value < 0.01  ~ "**",
      p.value < 0.05  ~ "*",
      TRUE ~ ""
    )
  ) |>
  select(term, estimate, std.error, statistic, p.value, stars) |>
  knitr::kable(digits = 4, caption = "Model 2: Significant Coefficients")
Model 2: Significant Coefficients
term estimate std.error statistic p.value stars
(Intercept) 3530.5984 240.4220 14.6850 0.0000 ***
person_income 0.0361 0.0007 48.2333 0.0000 ***
loan_int_rate 302.0300 12.1037 24.9534 0.0000 ***
person_emp_length 109.0334 9.5370 11.4327 0.0000 ***
loan_intentHOMEIMPROVEMENT 397.1937 152.9044 2.5977 0.0094 **

In linear regression, each coefficient represents the change in loan amount in USD for a one-unit increase in that predictor, holding all other variables constant.

person_income: The positive coefficient confirms that higher-earning borrowers request larger loans. This is consistent with lenders using income as a primary indicator of repayment capacity. Each additional dollar of annual income is associated with a small but statistically significant increase in loan amount.

loan_int_rate: The positive coefficient on interest rate may seem counterintuitive, but it reflects adverse selection in credit markets — riskier borrowers are charged higher rates and tend to request larger loans.

loan_intent: Loan intent is among the strongest predictors of loan size. Borrowers seeking home improvement or debt consolidation request significantly larger amounts than those seeking personal or medical loans. This confirms that the purpose of borrowing is a key determinant of loan demand, independent of income.

person_age and person_emp_length: Both have modest but significant effects. Older and more stably employed borrowers request slightly larger loans, reflecting greater financial experience and creditworthiness.

Answering the Economic Question

The original question was: what actually drives loan demand?

The answer is that it is not just income. Model 2 shows that loan intent is one of the strongest predictors of loan size borrowers seeking home improvement or debt consolidation request systematically larger amounts than those seeking personal or medical loans, even at the same income level. Income, interest rate, age, and employment length all contribute as well, but the key finding is that purpose matters independently.

This has a practical implication: a lender that looks only at income may consistently offer the wrong loan size to borrowers with specific purposes. Incorporating loan intent into credit models leads to more accurate and better-matched lending decisions.

Policy Implications

If loan intent independently predicts loan size beyond income and interest rate, then lenders that ignore it may systematically under- or over-lend to certain borrower segments. Policymakers and financial institutions could consider:

  1. Purpose-adjusted loan sizing — offering different loan products calibrated to the typical size needs of different intent categories.
  2. Employment-length thresholds — using employment stability as a secondary screening criterion for larger loan requests.
  3. Income-intent interaction — investigating whether the income-loan size relationship differs across intent categories, which could reveal segments where standard income-based models are insufficient.

Limitations and Reproducibility

Limitations

Limitation 1 — Missing credit score: The dataset does not include credit score information, which is one of the most important determinants of loan terms in practice. Its absence likely limits the predictive power of our models and may introduce omitted variable bias — borrowers with higher credit scores may systematically request larger loans, and this effect is currently absorbed by the income coefficient.

Limitation 2 — Cross-sectional data: The dataset covers a single point in time and cannot capture how borrowing behavior changes over the economic cycle. During recessions, borrowers may request smaller loans or lenders may tighten criteria, and these dynamics would not be reflected in this data.

Reproducibility

All random processes use set.seed(465) to ensure identical results on every render. All file paths are relative credit_risk_dataset.csv must be in the same directory as the .qmd file. The analysis uses standard CRAN packages: tidyverse and tidymodels. The document can be rendered with the RStudio Render button or quarto::quarto_render().

AI Use Log

During this project, an AI tool was consulted alongside the course materials for certain parts of the writing and code. The Week 7, 8, and 9 labs provided the core structure for the regression and cross-validation workflow. The AI was used to clarify specific implementation questions where the lab materials did not cover the exact use case, and all AI-suggested approaches were verified against the course content before use.

Final Reflections

Interaction 1 — Stage 1 Correction

Prompt given: “I have a binary outcome variable (default.payment.next.month) in my credit card dataset. My Stage 1 report analyzed the wrong variable. I used LIMIT_BAL instead of the actual target. My professor said the outcome should be Bernoulli, not log-normal. Can you explain what Bernoulli distribution is and how I should correctly analyze a binary outcome in R for a classification task?”

How the output was used: The AI explained that a binary variable follows a Bernoulli distribution and that logistic regression is the appropriate model. This was used to correct the Stage 2 classification approach. The explanation was verified against the Week 8 and Week 9 course labs before implementation.

Verification: The explanation was cross-checked against the Week 8 lab on classification and confirmed to be consistent with what was covered in class before applying it to the project.

Interaction 2 — CV Comparison Table

Prompt given: “I ran 5 fold cross validation using fit_resamples() and collected metrics with collect_metrics(). I also have test set metrics in a tibble. I want a single side by side table showing, for each metric, the CV mean and the test set value in the same row with a Difference column.”

How the output was used: The filter() and pull() approach suggested by the AI was adopted as it was cleaner than pivot_wider(). The code was adapted to our variable names and the output was manually verified before writing interpretations.

Verification: Each individual metric value was checked manually against the raw output of collect_metrics() and the metrics() tibble before including them in the comparison table.

Suggested Improvement

The most valuable improvement would be to include credit score as a predictor. It is the most important variable in real world lending decisions and its absence is the biggest limitation of our current model. Adding it would likely improve both RMSE and R-squared substantially.

New Economic Question

This analysis has inspired a follow up question: Does the relationship between income and loan amount differ across loan intent categories? In other words, does income predict loan size differently for education loans compared to home improvement loans? Answering this with interaction terms could reveal whether lenders should apply different income-based criteria depending on the purpose of the loan a finding with direct implications for product design and credit market efficiency.

Conclusion

This project applied the complete data science pipeline to a real-world credit market dataset. The key findings are:

  1. Loan amount can be predicted meaningfully using five variables: income, interest rate, age, employment length, and loan intent.
  2. A broader borrower profile including loan intent outperforms a simple income only model purpose of borrowing independently predicts loan size.
  3. Loan intent and income are the strongest predictors. Lenders that incorporate loan purpose into their models can make more precise lending decisions.
  4. Results are stable across five fold cross validation with no evidence of overfitting. Fully reproducible with set.seed(465).