| Model | Accuracy | Precision | Recall |
|---|---|---|---|
| Model 1: Core Financial (4 vars) ✓ SELECTED | 92.6% | 95.4% | 92.7% |
| Model 2: Full Profile (11 vars) | 92.4% | 95.3% | 92.3% |
Do lenders rely on credit scores alone — or does the full applicant profile matter?
Economic Question · Dataset · Probability Analysis · Modeling
Do lenders rely primarily on creditworthiness signals (CIBIL score, income) or does a broader applicant profile — assets, education, employment — independently predict loan approval?
The key question is not just can we predict approval, but what information lenders actually use. This distinction matters for financial inclusion policy: applicants with limited credit histories may face structural barriers regardless of their actual assets or earning potential.
| Source | Kaggle — Loan Approval Prediction |
| Observations | 4,269 applications |
| Target | loan_status (Approved / Rejected) |
| Predictors | CIBIL score, income, loan amount & term, assets, education, employment, dependents |
Cleaned with clean_names(), drop_na(), and str_trim(). No missing values remain.
The dataset is reasonably balanced at 62.2% approved vs 37.8% rejected — a naive majority-class classifier would not achieve high accuracy, so any model that performs well is genuinely learning. The CIBIL histogram shows near-complete separation: rejected applications cluster at low scores, approved at high scores.
We use logistic regression because the target is binary. It models the log-odds of approval as a linear function of predictors and produces interpretable coefficients.
4 predictors
| Variable | Economic role |
|---|---|
cibil_score |
Creditworthiness |
income_annum |
Repayment capacity |
loan_amount |
Lender exposure |
loan_term |
Repayment horizon |
Tests whether standard financial metrics alone are sufficient.
11 predictors (all above +)
| Variable | Economic role |
|---|---|
| Assets (4 types) | Collateral |
education |
Long-term income stability |
self_employed |
Income uncertainty |
no_of_dependents |
Financial obligations |
Tests whether a richer profile improves predictions.
Setup: 80/20 stratified split · 5-fold cross-validation ·
set.seed(465)
| Model | Accuracy | Precision | Recall |
|---|---|---|---|
| Model 1: Core Financial (4 vars) ✓ SELECTED | 92.6% | 95.4% | 92.7% |
| Model 2: Full Profile (11 vars) | 92.4% | 95.3% | 92.3% |
Model 1 wins on three grounds:
Coefficient Interpretation · Cross-Validation · Economic Implications · Conclusion
| Variable | Direction | Odds Ratio | Significance |
|---|---|---|---|
| CIBIL Score | ↑ Higher score → approved | 1.0245 | *** (z = -26.7) |
| Loan Term | ↑ Longer term → approved | 0.8624 | *** (z = 11.8) |
| Loan Amount | ↓ Larger loan → rejected | 1.0000 | *** (z = -7) |
| Annual Income | ↑ Higher income → approved | 1.0000 | *** (z = 6.8) |
All four predictors are statistically significant at p < 0.001. Each additional CIBIL point multiplies the odds of approval by 1.0245 — with a z-statistic of 26.7, it far dominates all other variables. Income increases approval odds (repayment capacity), larger loans reduce them (greater lender risk), and longer terms increase them (smaller per-period burden).
Note: tidymodels/glm predicts P(rejected) by default; signs are flipped here to reflect P(approved).
| Metric | 5-Fold CV Mean | Test Set | Difference |
|---|---|---|---|
| Accuracy | 91.5% | 92.6% | 1.10% |
| Precision | 93.0% | 95.4% | 2.38% |
| Recall | 93.5% | 92.7% | -0.79% |
The 5-fold CV results are within 1–3% of the test set across all metrics — confirming the model does not overfit. Performance is consistent across all five folds, meaning the results do not depend on any particular subset of the training data.
✅ No overfitting — test performance matches CV.
✅ Stable — consistent across all 5 folds.
set.seed(465)ensures all results are fully reproducible.
Lenders in this dataset use a narrow, credit-score-dominated model. CIBIL score, income, loan amount, and loan term are sufficient to predict approval with 92.6% accuracy — broader variables add nothing.
Policy implications:
1. Alternative credit scoring — incorporate utility payments, rental history, and mobile transactions for applicants without formal credit records.
2. Credit-building programs — small guaranteed loans to help applicants establish CIBIL histories.
3. Collateral-based lending — give weight to asset holdings for applicants with assets but no credit history.
External validity — Kaggle dataset from one lender. The dominance of CIBIL score may not generalise to other markets or regulatory environments.
Observational data — We identify correlations, not causal relationships. CIBIL score aggregates many aspects of credit history; its coefficient captures a composite effect.
Omitted variables — Loan purpose, applicant–lender relationship, and regional conditions are unobserved and could confound estimates.
Future question: > “What is the causal effect of a one-unit CIBIL increase on approval probability — and does it differ across income groups?”
| Finding | |
|---|---|
| 01 | Loan approval predicted with 92.6% accuracy using only 4 variables: CIBIL score, income, loan amount, and loan term. |
| 02 | A broader applicant profile (assets, education, employment, dependents) adds no predictive value over the simpler model. |
| 03 | Lenders operate a narrow, credit-score-dominated model — past repayment behaviour is the primary screening criterion. |
| 04 | Results are stable across 5-fold cross-validation with no overfitting. Fully reproducible with set.seed(465). |
Structural implication: Credit access is primarily gated by credit score. Applicants who lack credit histories face structural barriers regardless of their actual economic circumstances — a finding with direct implications for financial inclusion policy.
Thank you — we welcome your questions.
ECON 465 · Stage 3 Final Presentation