| Total Applicants | Average Age | Average Income | Average Credit Score | Approval Rate |
|---|---|---|---|---|
| 44993 | 27.7 | $79,908 | 632.5857 | 22.2% |
The Credit Frontier: Decoding Financial Risk and Ethical Lending
Introduction
Hello! My name is Tu Nguyen. I am a graduating senior at Xavier University majoring in Business Analytics and Information Systems (BAIS) with a minor in Statistics. As I close this chapter of my education to enter the corporate world, I have realized that financial literacy and data-driven decision making are more important than ever. Whether it is for a home, a car, or school tuition, loans are often the engine behind our biggest life milestones.
I am interested in how automated lending models balance mathematical risk with ethical fairness. In this project, I dive into the data to uncover the hidden patterns that drive these life changing financial decisions.
The Data set
To explore these questions, I am analyzing a Loan Approval Classification data set. This is a synthetic data set inspired by real-world credit risk data and enriched using SMOTENC simulation to provide a robust environment for binary classification. The data set contains 45,000 observations, each representing a unique loan applicant across 14 distinct variables.
Data Dictionary
| Name | Description | Type |
|---|---|---|
person_age |
Age of the person | Float |
person_gender |
Gender of the person | Categorical |
person_education |
Highest education level | Categorical |
person_income |
Annual income | Float |
person_emp_exp |
Years of employment experience | Integer |
person_home_ownership |
Home ownership status (e.g., rent, own, mortgage) | Categorical |
loan_amnt |
Loan amount requested | Float |
loan_intent |
Purpose of the loan | Categorical |
loan_int_rate |
Loan interest rate | Float |
loan_percent_income |
Loan amount as a percentage of annual income | Float |
cb_person_cred_hist_length |
Length of credit history in years | Float |
credit_score |
Credit score of the person | Integer |
previous_loan_defaults_on_file |
Indicator of previous loan defaults | Categorical |
loan_status (target variable) |
Loan approval status: 1 = approved; 0 = rejected | Integer |
Data replicability
To replicate this study, the raw data can be accessed via my OneDrive here: https://myxavier-my.sharepoint.com/:x:/g/personal/nguyent45_xavier_edu/IQA0hA6bgTTOSLnSwlTT9DBRAS6RuOjlJLsuwGHJgZqk8Wc?e=7L9q27
Questions to answer:
What is the summary statistics of our applicants?
Does the education level affect the approval rate?
Which loan purposes are the most and least likely to be approved by the automated system?
Does a high credit score guarantee approval or are there “hidden” factors causing high-score applicants to be rejected?
How does the annual income profile of approved borrowers differ from those who were rejected?
Is there a correlation between the total loan amount requested and the interest rate assigned by the algorithm?
Relatively young average applicant(27-28 age) with a solid income of ~$79k. The Approval rate is on the lower end at 22.2%, but this could be due to the low average credit score of 632.6.
Descriptive Analysis
To understand the patterns within the lending library, I developed six visualizations that explore the relationship between different predictors and financial outcomes.
This visualization reveals a hierarchy in how education levels influence the “trust” an automated model places in an applicant. While Doctorate holders see the highest success rates, an interesting anomaly appears in the bar chart. Master’s degree holders have lower approval rates than those with a Bachelor, High School(GED), and Associate. This suggest that the model is either over weighting specialized certifications or that applicants with certain advanced degrees are requesting loans that exceed their income to debt thresholds available.
This density plot reveals a risk reward paradox in the loan industry. Typically, we expect approved loans to cluster at lower interest rates, yet the data shows a significant approved peak near the 15% mark. This suggests that the automated system is active in high interest and yield lending approving more expensive loans while rejecting applicants at lower rate thresholds. For a borrower, this highlights a critical lesson that a yes from an algorithm doesn’t always mean you’ve been offered the most favorable market terms. This reveals a predatory method that is currently being used in this data set.
This is the most interesting finding to me. The lack of a credit score gap between approved and rejected applicants in terms of lending. The medians for both groups are nearly identical, centered around 640. This suggests that within this specific FinTech model, the traditional FICO-style credit score is not the primary driver of the final decision. This suggests a shift toward holistic modeling where factors like employment experience and income stability likely outweigh a single numerical score. This is evident in the 2 outliers hitting 780-800 but was still rejected for a loan.
I did a logarithmic scale to visualize the approval frontier area. A dense cluster of rejections (red) appears in the upper left quadrant, representing applicants asking for high loan amounts relative to lower annual incomes. This visual effectively maps the model’s hard-coded debt to income ceiling. Even with a strong profile an applicant can be rejected if their loan amount exceeded the boundary regardless of credit history or education.
The purpose of a loan significantly changes the risk weight assigned by the algorithm. Debt Consolidation and Education show a remarkably high share of approvals most likely because these are seen as investments in future financial stability. On the contrary, Venture and Medical intents show a higher proportion of rejections. These are seen as higher risk because most venture will fail and medical intents does not necessary have a way to calculate returns. This highlights a potential ethical concern in automated models may inadvertently penalize individuals facing medical emergencies. This illustrates the need for human oversight in AI driven finance decisions.
person_age person_income loan_amnt credit_score loan_int_rate
person_age 1.00 0.14 0.05 0.17 0.01
person_income 0.14 1.00 0.31 0.03 0.00
loan_amnt 0.05 0.31 1.00 0.01 0.15
credit_score 0.17 0.03 0.01 1.00 0.01
loan_int_rate 0.01 0.00 0.15 0.01 1.00
The correlation matrix confirms why the lending model is so complex. With most variables showing a near zero correlation with one another. No single factor like income or credit score can predict an approval on its own. The strongest relationship exists between credit score and interest rate, confirming that while a high score might not guarantee a yes, it is still the most reliable way to lower the cost of borrowing. Due to the weak correlation of the data set it is hard to understand the overall logic behind lending.
Secondary Data Source
This data set is from the Federal Reserve Bank of St. Louis. It is used to compared to the interest rate of the synthetic data set.
While the model only approves 22.2% of applicants, those who do receive a loan are charged an average interest rate of 11.01%. When bench-marked against the May 2026 market average of 6.37%, it becomes clear that this automated system is significantly more expensive than traditional lending options.
Conclusion
Overall, it was a great experience to analyze this loan data set and see how modern lending algorithms might be making decisions behind the scenes. The primary data set provided a look at how factors like income, age, and loan purpose interact to determine financial access.
With these two data sets together, it is clear that there are many patterns to be discovered when it comes to the logic of automated lending:
The selective nature of the model.
The Credit Score Paradox.
Intention to lend matters.
The Real Numbers comparison.