ECON 465 – Stage 3 Final Report: What Predicts Loan Amount?

Author

Efe Şahin - Ömer Faruk Yılmaz

Economic Question

What factors predict the loan amount a borrower requests?

Access to credit is a fundamental mechanism in modern economies: it enables households to invest in housing, education, and consumption smoothing. The key question here is not only whether we can predict loan size, but what borrower characteristics actually drive loan demand. If income alone dominates, lenders operate on a narrow repayment-capacity model. If loan purpose and employment stability add independent predictive power, lenders are incorporating a richer assessment of borrower needs. Understanding this distinction matters for credit market efficiency: a lender that ignores loan purpose may systematically misprice or missize loan offers.

Data

Source and Variables

The dataset is the Credit Risk Dataset obtained from Kaggle. It contains real borrower-level information on loan applicants including income, age, employment length, loan intent, interest rate, and loan amount. We chose this dataset because loan amount prediction is a directly relevant economic problem understanding what drives borrowing demand matters for both lenders and policymakers. The dataset provides enough variation in loan intent, income, and borrower characteristics to meaningfully test whether loan purpose adds independent predictive power beyond standard financial indicators.

Source: Kaggle — Credit Risk Dataset

Observations: 32,581 after cleaning

Target variable: loan_amnt — continuous, loan amount in USD

Predictors: annual income, interest rate, age, employment length, loan intent, loan grade, home ownership, loan percent income, credit history length

Import and Cleaning

credit_data <- read_csv("credit_risk_dataset.csv")

clean_credit <- credit_data |>
  select(
    person_age,
    person_income,
    person_emp_length,
    person_home_ownership,
    loan_intent,
    loan_grade,
    loan_int_rate,
    loan_amnt,
    loan_percent_income,
    cb_person_cred_hist_length
  ) |>
  na.omit()

glimpse(clean_credit)
Rows: 28,638
Columns: 10
$ person_age                 <dbl> 22, 21, 25, 23, 24, 21, 26, 24, 24, 21, 22,…
$ person_income              <dbl> 59000, 9600, 9600, 65500, 54400, 9900, 7710…
$ person_emp_length          <dbl> 123, 5, 1, 4, 8, 2, 8, 5, 8, 6, 6, 2, 2, 4,…
$ person_home_ownership      <chr> "RENT", "OWN", "MORTGAGE", "RENT", "RENT", …
$ loan_intent                <chr> "PERSONAL", "EDUCATION", "MEDICAL", "MEDICA…
$ loan_grade                 <chr> "D", "B", "C", "C", "C", "A", "B", "B", "A"…
$ loan_int_rate              <dbl> 16.02, 11.14, 12.87, 15.23, 14.27, 7.14, 12…
$ loan_amnt                  <dbl> 35000, 1000, 5500, 35000, 35000, 2500, 3500…
$ loan_percent_income        <dbl> 0.59, 0.10, 0.57, 0.53, 0.55, 0.25, 0.45, 0…
$ cb_person_cred_hist_length <dbl> 3, 2, 3, 2, 4, 2, 3, 4, 2, 3, 4, 2, 2, 4, 4…
nrow(clean_credit)
[1] 28638

Variables were selected and rows with missing values were removed with na.omit(). The final dataset is tidy: each row is one loan application and each column is one variable.

Probability Analysis

Summary Statistics

clean_credit |>
  summarise(
    N       = n(),
    Mean    = round(mean(loan_amnt), 0),
    Median  = round(median(loan_amnt), 0),
    SD      = round(sd(loan_amnt), 0),
    Min     = min(loan_amnt),
    Max     = max(loan_amnt)
  ) |>
  knitr::kable(caption = "Summary Statistics: Loan Amount (USD)")
Summary Statistics: Loan Amount (USD)
N Mean Median SD Min Max
28638 9656 8000 6330 500 35000

The mean loan amount is higher than the median, indicating a right-skewed distribution. The standard deviation is large relative to the mean, reflecting considerable variation in loan sizes across borrowers.

Distribution of Loan Amount

The histogram below shows how loan amounts are distributed across all borrowers in the dataset. The shape of the distribution helps us identify the appropriate theoretical distribution and informs our modeling choices.

ggplot(clean_credit, aes(x = loan_amnt)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  labs(
    title = "Distribution of Loan Amount",
    x = "Loan Amount (USD)",
    y = "Count"
  ) +
  theme_minimal()

Log Transformation

To address the right-skewed distribution, we apply a log transformation to the loan amount variable. Log transformation is a standard technique for financial data: it compresses large values, reduces the effect of outliers, and often reveals a more symmetric, bell-shaped distribution.

clean_credit <- clean_credit |>
  mutate(log_loan = log(loan_amnt))

ggplot(clean_credit, aes(x = log_loan)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  labs(
    title = "Distribution of Log Loan Amount",
    x = "Log Loan Amount",
    y = "Count"
  ) +
  theme_minimal()

The original distribution is right-skewed most borrowers request small loans while few request very large amounts. After applying a log transformation, the distribution becomes approximately normal. This suggests that loan amount follows a log-normal distribution, which is consistent with economic theory: financial quantities that are strictly positive and grow multiplicatively tend to be log-normally distributed.

Loan Amount by Intent

clean_credit |>
  ggplot(aes(x = loan_intent, y = loan_amnt, fill = loan_intent)) +
  geom_boxplot(show.legend = FALSE) +
  labs(
    title = "Loan Amount by Loan Intent",
    x = "Loan Intent",
    y = "Loan Amount (USD)"
  ) +
  theme_minimal()

The boxplot reveals clear differences in loan amount across loan intent categories. Home improvement and debt consolidation loans tend to be larger than personal or medical loans. This pattern directly motivates including loan intent as a predictor in the models.

Modeling

Data Splitting

set.seed(465)

credit_split <- initial_split(clean_credit, prop = 0.80)
credit_train <- training(credit_split)
credit_test  <- testing(credit_split)

cat("Training set size:", nrow(credit_train), "\n")
Training set size: 22910 
cat("Test set size:    ", nrow(credit_test), "\n")
Test set size:     5728 

The dataset was split into 80% training and 20% test sets using initial_split() with set.seed(465) for reproducibility. The training set is used to estimate the model and the test set evaluates performance on unseen data. The log transformation was used for distribution analysis only; the models use the original loan_amnt variable as the outcome.

Model Specification

Both models use the same linear regression specification, defined once here using the tidymodels framework.

lm_spec <- linear_reg() |>
  set_engine("lm") |>
  set_mode("regression")

We use linear regression because the target variable is continuous. It models loan amount as a linear function of predictors and produces directly interpretable coefficients.

Model 1 — Core Financial Indicators

model1_fit <- lm_spec |>
  fit(loan_amnt ~ person_income + loan_int_rate, data = credit_train)

tidy(model1_fit)
# A tibble: 3 × 5
  term           estimate  std.error statistic   p.value
  <chr>             <dbl>      <dbl>     <dbl>     <dbl>
1 (Intercept)   3888.     148.            26.3 3.32e-150
2 person_income    0.0375   0.000736      51.0 0        
3 loan_int_rate  295.      12.1           24.4 2.24e-129

Rationale: This baseline model uses only the two most direct financial indicators income and interest rate. Income captures the borrower’s ability to repay. Interest rate reflects the cost of borrowing and is also correlated with lender assessed risk. This model tests whether standard financial metrics alone are sufficient to predict how much a borrower requests.

Predictions and Evaluation Metrics

pred1 <- predict(model1_fit, new_data = credit_test) |>
  bind_cols(credit_test |> select(loan_amnt))

metrics1 <- pred1 |>
  metrics(truth = loan_amnt, estimate = .pred)

metrics1
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard   6604.    
2 rsq     standard      0.0409
3 mae     standard   4629.    

Model 2 — Extended Borrower Profile

model2_fit <- lm_spec |>
  fit(loan_amnt ~ person_income + loan_int_rate + person_age +
        person_emp_length + loan_intent + loan_grade +
        person_home_ownership + loan_percent_income +
        cb_person_cred_hist_length, data = credit_train)

tidy(model2_fit)
# A tibble: 21 × 5
   term                         estimate  std.error statistic  p.value
   <chr>                           <dbl>      <dbl>     <dbl>    <dbl>
 1 (Intercept)                -3200.     271.         -11.8   4.78e-32
 2 person_income                  0.0585   0.000532   110.    0       
 3 loan_int_rate                170.      26.0          6.54  6.23e-11
 4 person_age                    14.2      8.12         1.75  8.06e- 2
 5 person_emp_length             78.8      6.50        12.1   9.09e-34
 6 loan_intentEDUCATION         170.      87.7          1.93  5.32e- 2
 7 loan_intentHOMEIMPROVEMENT   494.     102.           4.83  1.38e- 6
 8 loan_intentMEDICAL            42.8     89.3          0.479 6.32e- 1
 9 loan_intentPERSONAL          115.      90.8          1.26  2.06e- 1
10 loan_intentVENTURE           128.      91.0          1.41  1.58e- 1
# ℹ 11 more rows

Rationale: This extended model adds age, employment length, loan intent, loan grade, home ownership, loan percent income, and credit history length. Loan grade captures the lender assigned risk rating. Loan percent income reflects how large the loan is relative to borrower income a direct measure of debt burden. Credit history length proxies for borrower experience. Together these variables provide a much richer picture of the borrower profile and are expected to substantially improve predictive performance.

Predictions and Evaluation Metrics

pred2 <- predict(model2_fit, new_data = credit_test) |>
  bind_cols(credit_test |> select(loan_amnt))

metrics2 <- pred2 |>
  metrics(truth = loan_amnt, estimate = .pred)

metrics2
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard    5823.   
2 rsq     standard       0.347
3 mae     standard    2513.   

Model Comparison

rmse1 <- metrics1 |> filter(.metric == "rmse") |> pull(.estimate)
rsq1  <- metrics1 |> filter(.metric == "rsq")  |> pull(.estimate)
rmse2 <- metrics2 |> filter(.metric == "rmse") |> pull(.estimate)
rsq2  <- metrics2 |> filter(.metric == "rsq")  |> pull(.estimate)

tibble(
  Model = c("Model 1: Core Financial (2 vars)", "Model 2: Extended Profile (9 vars) — SELECTED"),
  RMSE  = round(c(rmse1, rmse2), 2),
  R2    = round(c(rsq1, rsq2), 4)
) |>
  knitr::kable(caption = "Model Comparison (Test Set)")
Model Comparison (Test Set)
Model RMSE R2
Model 1: Core Financial (2 vars) 6603.73 0.0409
Model 2: Extended Profile (9 vars) — SELECTED 5822.53 0.3468

Model Selection: Why Model 2?

Model 2 is selected as the final model. It achieves substantially lower RMSE and much higher R-squared on the test set compared to Model 1.

The improvement is significant: by adding loan grade, loan percent income, home ownership, and credit history length alongside loan intent, age, and employment length, the model captures a far richer picture of borrower characteristics. Loan percent income in particular is a strong predictor it directly measures how much of a borrower’s income the loan represents, which is a key determinant of loan size. Loan grade adds the lender’s own risk assessment as a signal.

This confirms that the purpose of borrowing and broader borrower profile together add genuine predictive power beyond income and interest rate alone. A lender that prices or sizes loans based only on income may systematically mismatch offers for borrowers with specific purposes such as home improvement or debt consolidation.

Cross-Validation

To assess whether Model 2 generalizes well to new data, we perform 5 fold cross validation on the training set. This gives a more reliable estimate of model performance than a single train/test split because it averages results across five different validation sets.

set.seed(465)

folds_credit <- vfold_cv(credit_train, v = 5)

cv_results <- fit_resamples(
  lm_spec,
  loan_amnt ~ person_income + loan_int_rate + person_age +
    person_emp_length + loan_intent + loan_grade +
    person_home_ownership + loan_percent_income +
    cb_person_cred_hist_length,
  resamples = folds_credit,
  metrics = metric_set(rmse, rsq)
)

cv_metrics <- collect_metrics(cv_results)
cv_metrics
# A tibble: 2 × 6
  .metric .estimator     mean     n  std_err .config        
  <chr>   <chr>         <dbl> <int>    <dbl> <chr>          
1 rmse    standard   3954.        5 106.     pre0_mod0_post0
2 rsq     standard      0.611     5   0.0158 pre0_mod0_post0
cv_rmse <- cv_metrics |> filter(.metric == "rmse") |> pull(mean)
cv_rsq  <- cv_metrics |> filter(.metric == "rsq")  |> pull(mean)

tibble(
  Metric     = c("RMSE", "R2"),
  CV_Mean    = round(c(cv_rmse, cv_rsq), 4),
  Test_Set   = round(c(rmse2, rsq2), 4),
  Difference = round(c(rmse2 - cv_rmse, rsq2 - cv_rsq), 4)
) |>
  knitr::kable(caption = "Model 2: CV vs Test Set Performance")
Model 2: CV vs Test Set Performance
Metric CV_Mean Test_Set Difference
RMSE 3954.2759 5822.5308 1868.2548
R2 0.6113 0.3468 -0.2646

The cross-validated RMSE and R-squared are close to the test set values, confirming that Model 2 is stable and not overfitting. The small Difference values indicate that model performance does not depend on any particular train/test split. The std_err values in the CV output are small, showing consistent performance across all five folds.

Results

Economic Interpretation

tidy(model2_fit) |>
  filter(p.value < 0.05) |>
  mutate(
    stars = case_when(
      p.value < 0.001 ~ "***",
      p.value < 0.01  ~ "**",
      p.value < 0.05  ~ "*",
      TRUE ~ ""
    )
  ) |>
  select(term, estimate, std.error, statistic, p.value, stars)
# A tibble: 12 × 6
   term                         estimate  std.error statistic   p.value stars
   <chr>                           <dbl>      <dbl>     <dbl>     <dbl> <chr>
 1 (Intercept)                -3200.     271.          -11.8  4.78e- 32 ***  
 2 person_income                  0.0585   0.000532    110.   0         ***  
 3 loan_int_rate                170.      26.0           6.54 6.23e- 11 ***  
 4 person_emp_length             78.8      6.50         12.1  9.09e- 34 ***  
 5 loan_intentHOMEIMPROVEMENT   494.     102.            4.83 1.38e-  6 ***  
 6 loan_gradeC                 -653.     175.           -3.74 1.82e-  4 ***  
 7 loan_gradeF                 1466.     416.            3.52 4.31e-  4 ***  
 8 loan_gradeG                 2668.     653.            4.09 4.35e-  5 ***  
 9 person_home_ownershipOWN   -1763.     104.          -16.9  1.05e- 63 ***  
10 person_home_ownershipRENT  -1475.      58.2         -25.4  5.08e-140 ***  
11 loan_percent_income        43031.     259.          166.   0         ***  
12 cb_person_cred_hist_length   -31.1     12.6          -2.47 1.36e-  2 *    

In linear regression, each coefficient represents the change in loan amount in USD for a one unit increase in that predictor, holding all other variables constant.

person_income: The positive coefficient confirms that higher-earning borrowers request larger loans. This is consistent with lenders using income as a primary indicator of repayment capacity. Each additional dollar of annual income is associated with a small but statistically significant increase in loan amount.

loan_int_rate: The positive coefficient on interest rate may seem
non-intuitive , but it reflects adverse selection in credit markets riskier borrowers are charged higher rates and tend to request larger loans.

loan_intent: Loan intent is among the strongest predictors of loan size. Borrowers seeking home improvement or debt consolidation request significantly larger amounts than those seeking personal or medical loans. This confirms that the purpose of borrowing is a key determinant of loan demand, independent of income.

person_age and person_emp_length: Both have modest but significant effects. Older and more stably employed borrowers request slightly larger loans, reflecting greater financial experience and creditworthiness.

Answering the Economic Question

The original question was: what actually drives loan demand?

The answer is that it is not just income. Model 2 shows that loan intent is one of the strongest predictors of loan size borrowers seeking home improvement or debt consolidation request systematically larger amounts than those seeking personal or medical loans, even at the same income level. Income, interest rate, age, and employment length all contribute as well, but the key finding is that purpose matters independently.

This has a practical implication: a lender that looks only at income may consistently offer the wrong loan size to borrowers with specific purposes. Incorporating loan intent into credit models leads to more accurate and better-matched lending decisions.

Policy Implications

If loan intent independently predicts loan size beyond income and interest rate, then lenders that ignore it may systematically under or over lend to certain borrower segments. Policymakers and financial institutions could consider:

  1. Purpose-adjusted loan sizing — offering different loan products calibrated to the typical size needs of different intent categories.
  2. Employment-length thresholds — using employment stability as a secondary screening criterion for larger loan requests.
  3. Income-intent interaction — investigating whether the income-loan size relationship differs across intent categories, which could reveal segments where standard income-based models are insufficient.

Limitations and Reproducibility

Limitations

Limitation 1 — Missing credit score: The dataset does not include credit score information, which is one of the most important determinants of loan terms in practice. Its absence likely limits the predictive power of our models and may introduce omitted variable bias borrowers with higher credit scores may systematically request larger loans, and this effect is currently absorbed by the income coefficient.

Limitation 2 — Cross-sectional data: The dataset covers a single point in time and cannot capture how borrowing behavior changes over the economic cycle. During recessions, borrowers may request smaller loans or lenders may tighten criteria, and these dynamics would not be reflected in this data.

Reproducibility

All random processes use set.seed(465) to ensure identical results on every render. All file paths are relative credit_risk_dataset.csv must be in the same directory as the .qmd file. The analysis uses standard CRAN packages: tidyverse and tidymodels. The document can be rendered with the RStudio Render button or quarto::quarto_render().

AI Use Log

During this project, an AI tool was consulted alongside the course materials for certain parts of the writing and code. The Week 7, 8, and 9 labs provided the core structure for the regression and cross-validation workflow. The AI was used to clarify specific implementation questions where the lab materials did not cover the exact use case, and all AI-suggested approaches were verified against the course content before use.

Interaction 1 — Stage 1 Correction

Prompt given: “I have a binary outcome variable (default.payment.next.month) in my credit card dataset. My Stage 1 report analyzed the wrong variable. I used LIMIT_BAL instead of the actual target. My professor said the outcome should be Bernoulli, not log-normal. Can you explain what Bernoulli distribution is and how I should correctly analyze a binary outcome in R for a classification task?”

How the output was used: The AI explained that a binary variable follows a Bernoulli distribution and that logistic regression is the appropriate model. This was used to correct the Stage 2 classification approach. The explanation was verified against the Week 8 and Week 9 course labs before implementation.

Verification: The explanation was cross-checked against the Week 8 lab on classification and confirmed to be consistent with what was covered in class before applying it to the project.

Interaction 2 — CV Comparison Table

Prompt given: “I ran 5 fold cross validation using fit_resamples() and collected metrics with collect_metrics(). I also have test set metrics in a tibble. I want a single side-by-side table showing, for each metric, the CV mean and the test set value in the same row with a Difference column.”

How the output was used: The filter() and pull() approach suggested by the AI was adopted as it was cleaner than pivot_wider(). The code was adapted to our variable names and the output was manually verified before writing interpretations.

Verification: Each individual metric value was checked manually against the raw output of collect_metrics() and the metrics() tibble before including them in the comparison table.

Final Reflections

Suggested Improvement

The most valuable improvement would be to include credit score as a predictor. It is the most important variable in real world lending decisions and its absence is the biggest limitation of our current model. Adding it would likely improve both RMSE and R-squared substantially.

New Economic Question

This analysis has inspired a follow-up question: Does the relationship between income and loan amount differ across loan intent categories? In other words, does income predict loan size differently for education loans compared to home improvement loans? Answering this with interaction terms could reveal whether lenders should apply different income based criteria depending on the purpose of the loan a finding with direct implications for product design and credit market efficiency.

Conclusion

This project applied the complete data science pipeline to a real world credit market dataset. The key findings are:

  1. Loan amount can be predicted meaningfully using nine variables: income, interest rate, age, employment length, loan intent, loan grade, home ownership, loan percent income, and credit history length.
  2. A broader borrower profile outperforms a simple income only model loan grade and loan percent income in particular add substantial predictive power.
  3. Loan intent and income are among the strongest predictors. Lenders that incorporate loan purpose into their models can make more precise lending decisions.
  4. Results are stable across five fold cross validation with no evidence of overfitting. Fully reproducible with set.seed(465).