Credit Risk & Loan Approval

Do lenders rely on credit scores alone — or does the full applicant profile matter?

Ece Kurtoğlu & Halil Rıfat Başbuğ

Economic Question · Dataset · Probability Analysis · Modeling

Economic Question & Dataset

Economic Question

Do lenders rely primarily on creditworthiness signals (CIBIL score, income) or does a broader applicant profile — assets, education, employment — independently predict loan approval?

The key question is not just can we predict approval, but what information lenders actually use. This distinction matters for financial inclusion policy: applicants with limited credit histories may face structural barriers regardless of their actual assets or earning potential.

Dataset

Source Kaggle — Loan Approval Prediction
Observations 4,269 applications
Target loan_status (Approved / Rejected)
Predictors CIBIL score, income, loan amount & term, assets, education, employment, dependents

Cleaned with clean_names(), drop_na(), and str_trim(). No missing values remain.

Probability Analysis Highlights

The dataset is reasonably balanced at 62.2% approved vs 37.8% rejected — a naive majority-class classifier would not achieve high accuracy, so any model that performs well is genuinely learning. The CIBIL histogram shows near-complete separation: rejected applications cluster at low scores, approved at high scores.

Modeling Approach

We use logistic regression because the target is binary. It models the log-odds of approval as a linear function of predictors and produces interpretable coefficients.

Model 1 — Core Financial Indicators

4 predictors

Variable Economic role
cibil_score Creditworthiness
income_annum Repayment capacity
loan_amount Lender exposure
loan_term Repayment horizon

Tests whether standard financial metrics alone are sufficient.

Model 2 — Full Applicant Profile

11 predictors (all above +)

Variable Economic role
Assets (4 types) Collateral
education Long-term income stability
self_employed Income uncertainty
no_of_dependents Financial obligations

Tests whether a richer profile improves predictions.

Setup: 80/20 stratified split · 5-fold cross-validation · set.seed(465)

Model Comparison & Why We Chose Model 1

Model Accuracy Precision Recall
Model 1: Core Financial (4 vars) ✓ SELECTED 92.6% 95.4% 92.7%
Model 2: Full Profile (11 vars) 92.4% 95.3% 92.3%

Model 1 wins on three grounds:

  • Performance: Equal or higher accuracy, precision, and recall on the held-out test set despite using only 4 variables instead of 11. If the extra variables carried genuine signal, Model 2 would win — it does not.
  • Parsimony: Adding complexity without performance gain violates Occam’s Razor. The extra variables add noise, not signal.
  • Interpretability: A lender can explain every decision directly in terms of creditworthiness, repayment capacity, loan size, and term — important for regulatory compliance.

Coefficient Interpretation · Cross-Validation · Economic Implications · Conclusion

Coefficient Interpretation

Variable Direction Odds Ratio Significance
CIBIL Score ↑ Higher score → approved 0.9761 *** (z = -26.7)
Loan Term ↑ Longer term → approved 1.1596 *** (z = 11.8)
Loan Amount ↓ Larger loan → rejected 1.0000 *** (z = -7)
Annual Income ↑ Higher income → approved 1.0000 *** (z = 6.8)

All four predictors are statistically significant at p < 0.001. Each additional CIBIL point multiplies the odds of approval by 1.0245 — with a z-statistic of 26.7, it far dominates all other variables. Income increases approval odds (repayment capacity), larger loans reduce them (greater lender risk), and longer terms increase them (smaller per-period burden).

Note: tidymodels/glm predicts P(rejected) by default; signs are flipped here to reflect P(approved).

Cross-Validation & Model Stability

Metric 5-Fold CV Mean Test Set Difference
Accuracy 91.5% 92.6% 1.10%
Precision 93.0% 95.4% 2.38%
Recall 93.5% 92.7% -0.79%

The 5-fold CV results are within 1–3% of the test set across all metrics — confirming the model does not overfit. Performance is consistent across all five folds, meaning the results do not depend on any particular subset of the training data.

No overfitting — test performance matches CV.

Stable — consistent across all 5 folds.

set.seed(465) ensures all results are fully reproducible.

Economic Implications & Limitations

Answer to the Economic Question

Lenders in this dataset use a narrow, credit-score-dominated model. CIBIL score, income, loan amount, and loan term are sufficient to predict approval with 92.6% accuracy — broader variables add nothing.

Policy implications:

1. Alternative credit scoring — incorporate utility payments, rental history, and mobile transactions for applicants without formal credit records.

2. Credit-building programs — small guaranteed loans to help applicants establish CIBIL histories.

3. Collateral-based lending — give weight to asset holdings for applicants with assets but no credit history.

Limitations

External validity — Kaggle dataset from one lender. The dominance of CIBIL score may not generalise to other markets or regulatory environments.

Observational data — We identify correlations, not causal relationships. CIBIL score aggregates many aspects of credit history; its coefficient captures a composite effect.

Omitted variables — Loan purpose, applicant–lender relationship, and regional conditions are unobserved and could confound estimates.


Future question: > “What is the causal effect of a one-unit CIBIL increase on approval probability — and does it differ across income groups?”

Conclusion

Finding
01 Loan approval predicted with 92.6% accuracy using only 4 variables: CIBIL score, income, loan amount, and loan term.
02 A broader applicant profile (assets, education, employment, dependents) adds no predictive value over the simpler model.
03 Lenders operate a narrow, credit-score-dominated model — past repayment behaviour is the primary screening criterion.
04 Results are stable across 5-fold cross-validation with no overfitting. Fully reproducible with set.seed(465).

. . .

Structural implication: Credit access is primarily gated by credit score. Applicants who lack credit histories face structural barriers regardless of their actual economic circumstances — a finding with direct implications for financial inclusion policy.

. . .

Thank you — we welcome your questions.