Credit Risk & Loan Approval

Do lenders rely on credit scores alone — or does the full applicant profile matter?

Ece Kurtoğlu & Halil Rıfat Başbuğ

Part 1

Economic Question · Dataset · Probability Analysis · Modeling

Economic Question & Dataset

Economic Question

Do lenders rely primarily on creditworthiness signals (CIBIL score, income) or does a broader applicant profile — assets, education, employment — independently predict loan approval?

The key question is not just can we predict approval, but what information lenders actually use. This distinction matters for financial inclusion policy: applicants with limited credit histories may face structural barriers regardless of their actual assets or earning potential.

Dataset

Source	Kaggle — Loan Approval Prediction
Observations	4,269 applications
Target	`loan_status` (Approved / Rejected)
Predictors	CIBIL score, income, loan amount & term, assets, education, employment, dependents

Cleaned with clean_names(), drop_na(), and str_trim(). No missing values remain.

Probability Analysis Highlights

The dataset is reasonably balanced at 62.2% approved vs 37.8% rejected — a naive majority-class classifier would not achieve high accuracy, so any model that performs well is genuinely learning. The CIBIL histogram shows near-complete separation: rejected applications cluster at low scores, approved at high scores.

Modeling Approach

We use logistic regression because the target is binary. It models the log-odds of approval as a linear function of predictors and produces interpretable coefficients.

Model 1 — Core Financial Indicators

4 predictors

Variable	Economic role
`cibil_score`	Creditworthiness
`income_annum`	Repayment capacity
`loan_amount`	Lender exposure
`loan_term`	Repayment horizon

Tests whether standard financial metrics alone are sufficient.

Model 2 — Full Applicant Profile

11 predictors (all above +)

Variable	Economic role
Assets (4 types)	Collateral
`education`	Long-term income stability
`self_employed`	Income uncertainty
`no_of_dependents`	Financial obligations

Tests whether a richer profile improves predictions.

Setup: 80/20 stratified split · 5-fold cross-validation · set.seed(465)

Model Comparison & Why We Chose Model 1

Model	Accuracy	Precision	Recall
Model 1: Core Financial (4 vars) ✓ SELECTED	92.6%	95.4%	92.7%
Model 2: Full Profile (11 vars)	92.4%	95.3%	92.3%

Model 1 wins on three grounds:

Performance: Equal or higher accuracy, precision, and recall on the held-out test set despite using only 4 variables instead of 11. If the extra variables carried genuine signal, Model 2 would win — it does not.
Parsimony: Adding complexity without performance gain violates Occam’s Razor. The extra variables add noise, not signal.
Interpretability: A lender can explain every decision directly in terms of creditworthiness, repayment capacity, loan size, and term — important for regulatory compliance.

Part 2

Coefficient Interpretation · Cross-Validation · Economic Implications · Conclusion

Coefficient Interpretation

Variable	Direction	Odds Ratio	Significance
CIBIL Score	↑ Higher score → approved	1.0245	*** (z = -26.7)
Loan Term	↑ Longer term → approved	0.8624	*** (z = 11.8)
Loan Amount	↓ Larger loan → rejected	1.0000	*** (z = -7)
Annual Income	↑ Higher income → approved	1.0000	*** (z = 6.8)

All four predictors are statistically significant at p < 0.001. Each additional CIBIL point multiplies the odds of approval by 1.0245 — with a z-statistic of 26.7, it far dominates all other variables. Income increases approval odds (repayment capacity), larger loans reduce them (greater lender risk), and longer terms increase them (smaller per-period burden).

Note: tidymodels/glm predicts P(rejected) by default; signs are flipped here to reflect P(approved).

Cross-Validation & Model Stability

Metric	5-Fold CV Mean	Test Set	Difference
Accuracy	91.5%	92.6%	1.10%
Precision	93.0%	95.4%	2.38%
Recall	93.5%	92.7%	-0.79%

The 5-fold CV results are within 1–3% of the test set across all metrics — confirming the model does not overfit. Performance is consistent across all five folds, meaning the results do not depend on any particular subset of the training data.

✅ No overfitting — test performance matches CV.

✅ Stable — consistent across all 5 folds.

set.seed(465) ensures all results are fully reproducible.

Economic Implications & Limitations

Answer to the Economic Question

Lenders in this dataset use a narrow, credit-score-dominated model. CIBIL score, income, loan amount, and loan term are sufficient to predict approval with 92.6% accuracy — broader variables add nothing.

Policy implications:

1. Alternative credit scoring — incorporate utility payments, rental history, and mobile transactions for applicants without formal credit records.

2. Credit-building programs — small guaranteed loans to help applicants establish CIBIL histories.

3. Collateral-based lending — give weight to asset holdings for applicants with assets but no credit history.

Limitations

External validity — Kaggle dataset from one lender. The dominance of CIBIL score may not generalise to other markets or regulatory environments.

Observational data — We identify correlations, not causal relationships. CIBIL score aggregates many aspects of credit history; its coefficient captures a composite effect.

Omitted variables — Loan purpose, applicant–lender relationship, and regional conditions are unobserved and could confound estimates.

Future question: > “What is the causal effect of a one-unit CIBIL increase on approval probability — and does it differ across income groups?”

Conclusion

	Finding
01	Loan approval predicted with 92.6% accuracy using only 4 variables: CIBIL score, income, loan amount, and loan term.
02	A broader applicant profile (assets, education, employment, dependents) adds no predictive value over the simpler model.
03	Lenders operate a narrow, credit-score-dominated model — past repayment behaviour is the primary screening criterion.
04	Results are stable across 5-fold cross-validation with no overfitting. Fully reproducible with `set.seed(465)`.

Structural implication: Credit access is primarily gated by credit score. Applicants who lack credit histories face structural barriers regardless of their actual economic circumstances — a finding with direct implications for financial inclusion policy.

Thank you — we welcome your questions.