Credit Portfolio Risk Analysis Using Anonymised Loan Application Data

MBA Data Analytics 2 Case Study

Author

Olayinka Oloko

Published

May 5, 2026

1 1. Executive Summary

This case study was prepared for my MBA Data Analytics 2 course using anonymised loan application and repayment data from Renmoney Microfinance Bank. The analysis focuses on a practical credit question: what does the portfolio data reveal about repayment behaviour, default risk, and the customer and loan characteristics that can support better lending decisions?

The dataset was shared by Denis Burakov at my request (see Appendix A). Denis is Renmoney’s Data Science Team Lead, and the dataset contains a random sample of 1,000 digital loan applications covering 2024-09-04 to 2026-03-22. The main outcome variable is is_bad, where 1 represents a defaulted loan and 0 represents a repaid loan.

I applied five techniques: exploratory data analysis, data visualisation, hypothesis testing, correlation analysis, and logistic regression.

The data review identified missing values, an unusable marital status column, invalid date placeholders, and skewed loan amounts. These were handled through data cleaning, complete-case filtering, date treatment, and log transformation.

Overall, the analysis supports a practical recommendation: Renmoney should combine credit score bands, loan term performance, application channel trends, and predicted repayment probability to strengthen portfolio monitoring and credit decisioning.

Credit portfolio intelligence for sharper lending decisions

This report uses anonymised loan application data to build a practical view of credit risk. It reviews loan distribution, repayment performance, customer risk patterns, credit score behaviour, and repayment probability in a way that connects my MBA analysis to real portfolio monitoring work.

<div class="hero-kpi"><strong>EDA</strong><span>Loan amount and repayment patterns</span></div>
<div class="hero-kpi"><strong>Risk</strong><span>Default by term, channel, gender, and score band</span></div>
<div class="hero-kpi"><strong>Model</strong><span>Logistic regression for repayment probability</span></div>

2 2. Professional Disclosure

I work as a Product Manager at Renmoney Microfinance Bank, where I lead the Offline Lending channels. This case study connects directly to my day-to-day work because portfolio quality, repayment behaviour, customer risk segmentation, and credit decisioning all affect lending performance, profitability, risk management, and most importantly Renmoney’s P/L, which I am directly responsible for through my vertical’s contribution.

Exploratory Data Analysis. EDA is relevant to my work because a lending team needs to understand the quality and shape of portfolio data before using it for decisions. For this analysis, that means checking loan sizes, repayment outcomes, missing fields, unusual values, and patterns that may point to policy, process, or monitoring issues.

Data Visualisation. Visualisation matters because credit discussions are easier when the patterns are clear. Charts showing default rates by loan term, application channel, gender, and credit score band help managers quickly see where risk may be concentrated without working through raw tables.

Hypothesis Testing. Hypothesis testing helps separate meaningful portfolio differences from random variation. In this case, it helps check whether differences across loan terms or application channels are strong enough to influence business review.

Correlation Analysis. Correlation analysis helps show which numeric variables move together, such as credit score, income, DPD, loan amount, and default. This is useful for spotting repayment or risk signals that may deserve closer monitoring.

Logistic Regression. Logistic regression fits the business problem because the outcome is binary: a loan is either repaid or defaulted. The model gives a repayment probability that can support customer risk ranking, portfolio monitoring, and credit review.

3 3. Data Collection & Sampling

The dataset used for this case study came from Renmoney’s internal loan portfolio records. Before I used it for this analysis, the data had been anonymised. It does not include customer names, phone numbers, account numbers, BVN, addresses, or any other personally identifiable information.

The primary data for this analysis was obtained from my workplace. It represents anonymised loan application and repayment records from the credit portfolio. I cleaned the dataset in RStudio by removing fields that had no analytical value, excluding rows with missing values, and preparing the relevant variables for EDA, visualisation, statistical testing, correlation analysis, and logistic regression.

The sampling frame was a random sample of 1,000 digital loan applications from the internal portfolio. The applications cover the period from 2024-09-04 to 2026-03-22. The outcome variable is is_bad, where 1 indicates a bad or defaulted loan and 0 indicates a good or repaid loan.

The analysis was conducted only for academic purposes, and the dataset was treated as confidential. No personally identifiable customer information was used or disclosed. Because the dataset contains confidential internal business information, the raw file cannot be shared publicly. Evidence of data access and confirmation from the organisation’s Data Analyst is included as a screenshot in the appendix.

Due to internal data protection and confidentiality requirements, the full dataset cannot be attached to this submission or published publicly. Only anonymised analytical outputs, summary tables, charts, and screenshots confirming the data source are included.

Evidence of the data source confirmation is provided in Appendix A.

4 Setup

Code
packages <- c(
  "tidyverse",
  "janitor",
  "skimr",
  "lubridate",
  "broom",
  "car",
  "pROC",
  "caret",
  "forcats",
  "knitr",
  "ggcorrplot",
  "patchwork"
)

installed_packages <- rownames(installed.packages())

for (pkg in packages) {
  if (!(pkg %in% installed_packages)) {
    install.packages(pkg)
  }
}

library(tidyverse)
library(janitor)
library(skimr)
library(lubridate)
library(broom)
library(car)
library(pROC)
library(caret)
library(forcats)
library(knitr)
library(ggcorrplot)
library(patchwork)

theme_set(
  theme_minimal(base_size = 12, base_family = "sans") +
    theme(
      plot.title = element_text(face = "bold", size = 15, color = "#37352f"),
      plot.subtitle = element_text(color = "grey35"),
      axis.title = element_text(face = "bold", color = "#37352f"),
      axis.text = element_text(color = "grey30"),
      panel.grid.minor = element_blank(),
      panel.grid.major = element_line(color = "grey90")
    )
)

5 Data Loading

This document assumes that dataset_credit.csv is saved in the same RStudio project folder as this Quarto document.

Code
loan_data <- read_csv("dataset_credit.csv")

glimpse(loan_data)
Rows: 1,000
Columns: 14
$ loan_id                 <chr> "8c73c3de-6018-444a-a28c-b1fcac2501dd", "05506…
$ loan_amount             <dbl> 50000, 5000, 50000, 181000, 20000, 50000, 2310…
$ loan_term               <chr> "1MONTHS", "2WEEKS", "1MONTHS", "6MONTHS", "1W…
$ is_bad                  <dbl> 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1…
$ gender                  <chr> "Male", "Male", "Male", "Male", "Male", "Male"…
$ marital_status          <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ most_recent_credit_date <dttm> 2021-11-06, 2024-10-24, 2024-06-01, 2025-02-2…
$ dependants              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, …
$ income                  <dbl> 88571.429, NA, NA, NA, 725608.667, 68479.000, …
$ application_channel     <chr> "Android", "Android", "Android", "iOS", "Andro…
$ currentDPD              <dbl> 0, 0, NA, 0, 29, NA, 0, 0, 0, 0, NA, 0, 360, 0…
$ maxDPD                  <dbl> 0, 0, NA, 0, 29, NA, 0, 0, 0, 0, NA, 0, 360, 0…
$ has_gps                 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ credit_score            <dbl> 0.27273934, 0.43935433, 0.30403026, 0.24990837…
Code
dim(loan_data)
[1] 1000   14
Code
names(loan_data)
 [1] "loan_id"                 "loan_amount"            
 [3] "loan_term"               "is_bad"                 
 [5] "gender"                  "marital_status"         
 [7] "most_recent_credit_date" "dependants"             
 [9] "income"                  "application_channel"    
[11] "currentDPD"              "maxDPD"                 
[13] "has_gps"                 "credit_score"           

6 Data Cleaning

The cleaning steps below prepare the portfolio data for analysis. I standardised the column names, removed fields that had no analytical value, converted blank cells to missing values, removed incomplete rows, treated invalid dates, and created the variables needed for the case study.

Code
loan_working <- loan_data %>%
  clean_names() %>%
  select(-any_of("marital_status")) %>%
  mutate(
    across(
      where(is.character),
      ~ na_if(str_squish(.), "")
    ),
    most_recent_credit_date = ymd(most_recent_credit_date),
    most_recent_credit_date = if_else(
      most_recent_credit_date < ymd("2000-01-01"),
      as.Date(NA),
      most_recent_credit_date
    )
  )

missing_before <- loan_working %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(
    cols = everything(),
    names_to = "variable",
    values_to = "missing_count"
  ) %>%
  arrange(desc(missing_count))

kable(
  missing_before,
  caption = "Missing Values Before Final Row Filtering"
)
Missing Values Before Final Row Filtering
variable missing_count
income 723
gender 236
dependants 211
most_recent_credit_date 148
current_dpd 93
max_dpd 93
loan_id 0
loan_amount 0
loan_term 0
is_bad 0
application_channel 0
has_gps 0
credit_score 0
Code
loan_clean_no_empty <- loan_working %>%
  drop_na() %>%
  mutate(
    repayment_status = if_else(is_bad == 1, "Defaulted", "Repaid"),
    repayment_status = factor(
      repayment_status,
      levels = c("Repaid", "Defaulted")
    ),
    repayment_good = if_else(is_bad == 0, 1, 0),
    loan_term = as.factor(loan_term),
    gender = as.factor(gender),
    application_channel = as.factor(application_channel),
    has_gps = as.factor(has_gps),
    log_loan_amount = log1p(loan_amount)
  )

6.1 Data Quality Issues and Treatment

Code
data_quality_issues <- tibble(
  issue = c(
    "Marital status column had no usable values",
    "Several fields had missing values",
    "Some credit date values were invalid placeholders",
    "Loan amount was skewed"
  ),
  treatment = c(
    "Removed the column because it did not add analytical value",
    "Converted blanks to NA and removed rows with missing cells for a complete case analysis",
    "Converted dates before 2000-01-01 to NA before filtering",
    "Created log_loan_amount to reduce skewness for analysis and hypothesis testing"
  )
)

kable(data_quality_issues, caption = "Data Quality Issues Identified and Handled")
Data Quality Issues Identified and Handled
issue treatment
Marital status column had no usable values Removed the column because it did not add analytical value
Several fields had missing values Converted blanks to NA and removed rows with missing cells for a complete case analysis
Some credit date values were invalid placeholders Converted dates before 2000-01-01 to NA before filtering
Loan amount was skewed Created log_loan_amount to reduce skewness for analysis and hypothesis testing
Code
missing_after <- loan_clean_no_empty %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(
    cols = everything(),
    names_to = "variable",
    values_to = "missing_count"
  ) %>%
  arrange(desc(missing_count))

kable(
  missing_after,
  caption = "Missing Values After Cleaning"
)
Missing Values After Cleaning
variable missing_count
loan_id 0
loan_amount 0
loan_term 0
is_bad 0
gender 0
most_recent_credit_date 0
dependants 0
income 0
application_channel 0
current_dpd 0
max_dpd 0
has_gps 0
credit_score 0
repayment_status 0
repayment_good 0
log_loan_amount 0
Code
dim(loan_clean_no_empty)
[1] 180  16

7 4. Data Description

Code
skim(loan_clean_no_empty)
Data summary
Name loan_clean_no_empty
Number of rows 180
Number of columns 16
_______________________
Column type frequency:
character 1
Date 1
factor 5
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
loan_id 0 1 36 36 0 180 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
most_recent_credit_date 0 1 2000-01-01 2025-09-09 2024-08-12 167

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
loan_term 0 1 FALSE 6 1MO: 170, 3MO: 4, 1WE: 2, 2MO: 2
gender 0 1 FALSE 2 Mal: 129, Fem: 51
application_channel 0 1 FALSE 2 And: 158, iOS: 22
has_gps 0 1 FALSE 2 0: 157, 1: 23
repayment_status 0 1 FALSE 2 Rep: 107, Def: 73

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
loan_amount 0 1 72746.67 77066.11 2.0e+04 50000.00 50000.00 70000.00 892000.00 ▇▁▁▁▁
is_bad 0 1 0.41 0.49 0.0e+00 0.00 0.00 1.00 1.00 ▇▁▁▁▆
dependants 0 1 0.00 0.00 0.0e+00 0.00 0.00 0.00 0.00 ▁▁▇▁▁
income 0 1 193824.77 289218.53 3.0e-01 35085.71 87285.71 190567.49 1610400.00 ▇▁▁▁▁
current_dpd 0 1 276.11 276.77 0.0e+00 0.00 360.00 360.00 1453.00 ▇▆▃▁▁
max_dpd 0 1 713.56 5703.75 0.0e+00 0.00 360.00 360.00 76700.00 ▇▁▁▁▁
credit_score 0 1 0.42 0.22 6.0e-02 0.23 0.42 0.58 0.92 ▇▇▇▆▂
repayment_good 0 1 0.59 0.49 0.0e+00 0.00 1.00 1.00 1.00 ▆▁▁▁▇
log_loan_amount 0 1 11.03 0.46 9.9e+00 10.82 10.82 11.16 13.70 ▁▇▁▁▁
Code
variable_description <- tibble(
  variable = c(
    "loan_id",
    "loan_amount",
    "loan_term",
    "is_bad",
    "repayment_status",
    "repayment_good",
    "gender",
    "most_recent_credit_date",
    "dependants",
    "income",
    "application_channel",
    "current_dpd",
    "max_dpd",
    "has_gps",
    "credit_score",
    "log_loan_amount"
  ),
  description = c(
    "Anonymised unique loan identifier",
    "Amount borrowed by the customer",
    "Duration of the loan",
    "Outcome variable where 1 means default and 0 means repayment",
    "Readable repayment label: Repaid or Defaulted",
    "Binary repayment variable where 1 means repaid and 0 means defaulted",
    "Applicant gender",
    "Most recent credit activity date",
    "Number of dependants",
    "Applicant income",
    "Application source or channel",
    "Current days past due",
    "Maximum days past due",
    "Whether GPS data is available",
    "Applicant credit score",
    "Log transformed loan amount"
  )
)

kable(variable_description, caption = "Variable Description")
Variable Description
variable description
loan_id Anonymised unique loan identifier
loan_amount Amount borrowed by the customer
loan_term Duration of the loan
is_bad Outcome variable where 1 means default and 0 means repayment
repayment_status Readable repayment label: Repaid or Defaulted
repayment_good Binary repayment variable where 1 means repaid and 0 means defaulted
gender Applicant gender
most_recent_credit_date Most recent credit activity date
dependants Number of dependants
income Applicant income
application_channel Application source or channel
current_dpd Current days past due
max_dpd Maximum days past due
has_gps Whether GPS data is available
credit_score Applicant credit score
log_loan_amount Log transformed loan amount
Code
portfolio_summary <- loan_clean_no_empty %>%
  summarise(
    original_sample_size = nrow(loan_data),
    cleaned_sample_size = n(),
    total_loan_value = sum(loan_amount),
    average_loan_amount = mean(loan_amount),
    median_loan_amount = median(loan_amount),
    minimum_loan_amount = min(loan_amount),
    maximum_loan_amount = max(loan_amount),
    average_income = mean(income),
    average_credit_score = mean(credit_score),
    total_defaulted = sum(is_bad == 1),
    total_repaid = sum(is_bad == 0),
    default_rate = mean(is_bad == 1) * 100,
    repayment_rate = mean(is_bad == 0) * 100
  )

kable(
  portfolio_summary,
  digits = 2,
  caption = "Portfolio Summary"
)
Portfolio Summary
original_sample_size cleaned_sample_size total_loan_value average_loan_amount median_loan_amount minimum_loan_amount maximum_loan_amount average_income average_credit_score total_defaulted total_repaid default_rate repayment_rate
1000 180 13094400 72746.67 50000 20000 892000 193824.8 0.42 73 107 40.56 59.44

8 5. Technique 1: Exploratory Data Analysis

8.1 Theory Recap in This Case

Exploratory Data Analysis is used here as the first credit portfolio check before any formal testing or modelling. In this Renmoney sample, the goal is not only to describe the data, but to confirm whether the cleaned dataset is reliable enough to support a repayment and default risk discussion.

This means checking the size of the usable sample, the spread of loan amounts, the repayment/default split, missing value issues, date quality, and whether outliers may distort the portfolio view. These checks matter because a credit decision based on unclean or misunderstood data can lead to poor segmentation, weak monitoring, or misleading portfolio conclusions.

8.2 Business Justification

In my role, EDA supports the kind of first-level portfolio review that happens before a product or risk decision is made. If the cleaned data shows a high default share, unusual loan concentration, or major data quality problems, the business should pause before drawing conclusions from later tests or models. This makes EDA a practical control step, not just a descriptive exercise.

Code
loan_amount_summary <- loan_clean_no_empty %>%
  summarise(
    mean_loan = mean(loan_amount),
    median_loan = median(loan_amount),
    min_loan = min(loan_amount),
    max_loan = max(loan_amount),
    standard_deviation = sd(loan_amount),
    q1 = quantile(loan_amount, 0.25),
    q3 = quantile(loan_amount, 0.75),
    iqr = IQR(loan_amount)
  )

kable(
  loan_amount_summary,
  digits = 2,
  caption = "Loan Amount Summary"
)
Loan Amount Summary
mean_loan median_loan min_loan max_loan standard_deviation q1 q3 iqr
72746.67 50000 20000 892000 77066.11 50000 70000 20000
Code
repayment_summary <- loan_clean_no_empty %>%
  count(repayment_status) %>%
  mutate(
    percentage = round(n / sum(n) * 100, 2)
  )

kable(
  repayment_summary,
  caption = "Repayment Status Summary"
)
Repayment Status Summary
repayment_status n percentage
Repaid 107 59.44
Defaulted 73 40.56

8.3 Outlier Detection

Code
loan_outlier_limits <- loan_amount_summary %>%
  mutate(
    lower_limit = q1 - 1.5 * iqr,
    upper_limit = q3 + 1.5 * iqr
  ) %>%
  select(lower_limit, upper_limit)

loan_outliers <- loan_clean_no_empty %>%
  filter(
    loan_amount < loan_outlier_limits$lower_limit |
      loan_amount > loan_outlier_limits$upper_limit
  )

outlier_summary <- tibble(
  number_of_outliers = nrow(loan_outliers),
  percentage_of_cleaned_data = round(nrow(loan_outliers) / nrow(loan_clean_no_empty) * 100, 2)
)

kable(outlier_summary, caption = "Loan Amount Outlier Summary")
Loan Amount Outlier Summary
number_of_outliers percentage_of_cleaned_data
15 8.33

8.4 Interpretation

The EDA shows that the cleaned dataset is still useful for the case study and for a basic portfolio review. The loan amount summary helps show the typical ticket size in the sample, while the repayment summary shows the split between repaid and defaulted loans. The outlier check is important because a few unusually large loans can pull averages upward and make the portfolio appear more exposed than it really is.

This section helps separate two things: what the portfolio looks like on the surface and whether the data is clean enough to support the next stage of statistical analysis.

9 6. Technique 2: Data Visualisation

9.1 Theory Recap in This Case

Data visualisation is used here to turn the cleaned loan records into a clear portfolio story.

The histogram is used because loan amount is a numeric variable and needs a distribution view. Bar charts are used because repayment status, loan term, application channel, and credit score bands are group comparisons. This makes the analysis easier to interpret without reading through the full dataset.

9.2 Business Justification

For Renmoney, visualisation helps move the discussion from raw tables to risk patterns. If a particular loan term, application channel, or score band shows a higher default rate, the business can decide where to investigate first. The charts do not prove cause and effect, but they help identify where portfolio review and risk monitoring should focus.

Code
default_by_term <- loan_clean_no_empty %>%
  group_by(loan_term) %>%
  summarise(
    total_loans = n(),
    defaulted_loans = sum(is_bad == 1),
    repaid_loans = sum(is_bad == 0),
    default_rate = round(mean(is_bad == 1) * 100, 2),
    .groups = "drop"
  ) %>%
  arrange(desc(default_rate))

default_by_channel <- loan_clean_no_empty %>%
  group_by(application_channel) %>%
  summarise(
    total_loans = n(),
    defaulted_loans = sum(is_bad == 1),
    repaid_loans = sum(is_bad == 0),
    default_rate = round(mean(is_bad == 1) * 100, 2),
    .groups = "drop"
  ) %>%
  arrange(desc(default_rate))

default_by_gender <- loan_clean_no_empty %>%
  group_by(gender) %>%
  summarise(
    total_loans = n(),
    defaulted_loans = sum(is_bad == 1),
    repaid_loans = sum(is_bad == 0),
    default_rate = round(mean(is_bad == 1) * 100, 2),
    .groups = "drop"
  ) %>%
  arrange(desc(default_rate))

loan_clean_no_empty <- loan_clean_no_empty %>%
  mutate(
    credit_score_band = ntile(credit_score, 5),
    credit_score_band = factor(
      credit_score_band,
      labels = c("Band 1", "Band 2", "Band 3", "Band 4", "Band 5")
    )
  )

default_by_score_band <- loan_clean_no_empty %>%
  group_by(credit_score_band) %>%
  summarise(
    total_loans = n(),
    average_credit_score = round(mean(credit_score), 4),
    defaulted_loans = sum(is_bad == 1),
    repaid_loans = sum(is_bad == 0),
    default_rate = round(mean(is_bad == 1) * 100, 2),
    .groups = "drop"
  )

kable(default_by_term, caption = "Default Rate by Loan Term")
Default Rate by Loan Term
loan_term total_loans defaulted_loans repaid_loans default_rate
2MONTHS 2 2 0 100.00
4MONTHS 1 1 0 100.00
6MONTHS 1 1 0 100.00
1MONTHS 170 69 101 40.59
1WEEKS 2 0 2 0.00
3MONTHS 4 0 4 0.00
Code
kable(default_by_channel, caption = "Default Rate by Application Channel")
Default Rate by Application Channel
application_channel total_loans defaulted_loans repaid_loans default_rate
Android 158 68 90 43.04
iOS 22 5 17 22.73
Code
kable(default_by_gender, caption = "Default Rate by Gender")
Default Rate by Gender
gender total_loans defaulted_loans repaid_loans default_rate
Male 129 55 74 42.64
Female 51 18 33 35.29
Code
kable(default_by_score_band, caption = "Default Rate by Credit Score Band")
Default Rate by Credit Score Band
credit_score_band total_loans average_credit_score defaulted_loans repaid_loans default_rate
Band 1 36 0.1333 3 33 8.33
Band 2 36 0.2622 8 28 22.22
Band 3 36 0.4097 9 27 25.00
Band 4 36 0.5549 22 14 61.11
Band 5 36 0.7313 31 5 86.11

9.3 Visualisation Narrative

The five plots below tell one story: how the credit portfolio is distributed, how loans perform, and where default risk is concentrated.

Code
p1 <- ggplot(loan_clean_no_empty, aes(x = loan_amount)) +
  geom_histogram(bins = 30, fill = "#37352f", alpha = 0.85) +
  labs(
    title = "Loan Amount Distribution",
    subtitle = "Shows how credit exposure is spread",
    x = "Loan Amount",
    y = "Number of Customers"
  )

p2 <- ggplot(repayment_summary, aes(x = repayment_status, y = n)) +
  geom_col(fill = "#37352f", alpha = 0.9) +
  labs(
    title = "Repayment Status",
    subtitle = "Compares repaid and defaulted loans",
    x = "Repayment Status",
    y = "Number of Loans"
  )

p3 <- ggplot(default_by_term, aes(x = reorder(loan_term, default_rate), y = default_rate)) +
  geom_col(fill = "#37352f", alpha = 0.9) +
  coord_flip() +
  labs(
    title = "Default Rate by Loan Term",
    subtitle = "Highlights tenor based risk differences",
    x = "Loan Term",
    y = "Default Rate (%)"
  )

p4 <- ggplot(default_by_channel, aes(x = application_channel, y = default_rate)) +
  geom_col(fill = "#37352f", alpha = 0.9) +
  labs(
    title = "Default Rate by Channel",
    subtitle = "Compares application source risk",
    x = "Application Channel",
    y = "Default Rate (%)"
  )

p5 <- ggplot(default_by_score_band, aes(x = credit_score_band, y = default_rate)) +
  geom_col(fill = "#37352f", alpha = 0.9) +
  labs(
    title = "Default Rate by Credit Score Band",
    subtitle = "Shows risk movement across score segments",
    x = "Credit Score Band",
    y = "Default Rate (%)"
  )

(p1 | p2) / (p3 | p4) / p5 +
  plot_annotation(
    title = "Credit Portfolio Visualisation Narrative",
    subtitle = "From portfolio distribution to default concentration"
  )

9.4 Interpretation

The charts make the portfolio story easier to follow because they move from broad exposure to specific risk segments. The loan amount chart shows how credit is distributed, the repayment chart shows the balance between good and bad loans, and the remaining charts show where default risk appears to be higher.

For a manager, the value is that the charts show where to start asking better credit questions. A higher default rate in a loan term, application channel, or score band does not automatically mean the bank should stop lending there. It means the segment deserves closer review to understand whether the issue is customer mix, affordability, onboarding quality, monitoring, or repayment behaviour.

10 7. Technique 3: Hypothesis Testing

10.1 Theory Recap in This Case

Hypothesis testing is used here to check whether selected portfolio differences are strong enough to take seriously. The original case study required a test of whether interest rates differ across loan categories, but this dataset does not include interest rate or loan category. Because of that limitation, I used a substitute test that still fits the credit context: whether log loan amount differs across loan terms.

The second test checks whether repayment status is associated with application channel. This matters because channel performance can reflect customer mix, onboarding quality, fraud controls, or post-disbursement engagement.

10.2 Business Justification

Hypothesis testing helps avoid overreacting to visible differences in charts. A chart may show that one channel or loan term has a different default pattern, but the test helps decide whether the pattern is statistically meaningful. The effect size adds another layer by showing whether the difference is large enough to matter for business review.

10.3 Hypothesis Test 1: Do Loan Amounts Differ by Loan Term?

H0: Average log loan amount does not differ across loan terms.

H1: Average log loan amount differs across loan terms.

A log transformation is used because loan amount is skewed.

Code
log_loan_amount_term_test <- aov(
  log_loan_amount ~ loan_term,
  data = loan_clean_no_empty
)

summary(log_loan_amount_term_test)
             Df Sum Sq Mean Sq F value Pr(>F)    
loan_term     5  20.47   4.094   39.72 <2e-16 ***
Residuals   174  17.94   0.103                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
anova_result <- tidy(log_loan_amount_term_test)

kable(
  anova_result,
  digits = 4,
  caption = "ANOVA: Log Loan Amount by Loan Term"
)
ANOVA: Log Loan Amount by Loan Term
term df sumsq meansq statistic p.value
loan_term 5 20.4714 4.0943 39.7191 0
Residuals 174 17.9360 0.1031 NA NA

10.3.1 Assumption Checks

Code
anova_residuals <- residuals(log_loan_amount_term_test)

shapiro_result <- shapiro.test(anova_residuals)

levene_result <- car::leveneTest(
  log_loan_amount ~ loan_term,
  data = loan_clean_no_empty
)

shapiro_result

    Shapiro-Wilk normality test

data:  anova_residuals
W = 0.71788, p-value < 2.2e-16
Code
levene_result

10.3.2 Effect Size

Code
anova_table <- anova(log_loan_amount_term_test)

eta_squared_manual <- anova_table[["Sum Sq"]][1] /
  sum(anova_table[["Sum Sq"]], na.rm = TRUE)

eta_squared_table <- tibble(
  effect_size = "Eta squared",
  value = eta_squared_manual
)

kable(
  eta_squared_table,
  digits = 4,
  caption = "Effect Size for ANOVA"
)
Effect Size for ANOVA
effect_size value
Eta squared 0.533

10.4 Hypothesis Test 2: Is Repayment Status Associated with Application Channel?

H0: Repayment status is independent of application channel.

H1: Repayment status is associated with application channel.

Code
channel_default_table <- table(
  loan_clean_no_empty$application_channel,
  loan_clean_no_empty$repayment_status
)

channel_default_table
         
          Repaid Defaulted
  Android     90        68
  iOS         17         5
Code
channel_chi_square <- chisq.test(
  channel_default_table,
  simulate.p.value = TRUE,
  B = 10000
)

channel_chi_square

    Pearson's Chi-squared test with simulated p-value (based on 10000
    replicates)

data:  channel_default_table
X-squared = 3.3044, df = NA, p-value = 0.1005
Code
tidy(channel_chi_square) %>%
  kable(
    digits = 4,
    caption = "Chi Square Test: Repayment Status by Application Channel"
  )
Chi Square Test: Repayment Status by Application Channel
statistic p.value parameter method
3.3044 0.1005 NA Pearson’s Chi-squared test with simulated p-value
(based on 10000 replicates)

10.4.1 Assumption Check and Effect Size

Code
expected_channel_counts <- suppressWarnings(
  chisq.test(channel_default_table)$expected
)

expected_channel_counts
         
            Repaid Defaulted
  Android 93.92222 64.077778
  iOS     13.07778  8.922222
Code
cramers_v_manual <- function(tbl) {
  chi_value <- suppressWarnings(chisq.test(tbl, correct = FALSE)$statistic)
  n <- sum(tbl)
  min_dimension <- min(nrow(tbl), ncol(tbl))
  sqrt(as.numeric(chi_value) / (n * (min_dimension - 1)))
}

channel_cramers_v <- tibble(
  effect_size = "Cramer's V",
  value = cramers_v_manual(channel_default_table)
)

kable(
  channel_cramers_v,
  digits = 4,
  caption = "Effect Size for Chi Square Test"
)
Effect Size for Chi Square Test
effect_size value
Cramer’s V 0.1355

10.5 Additional Test: Is Repayment Status Associated with Loan Term?

Code
loan_term_default_table <- table(
  loan_clean_no_empty$loan_term,
  loan_clean_no_empty$repayment_status
)

loan_term_chi_square <- chisq.test(
  loan_term_default_table,
  simulate.p.value = TRUE,
  B = 10000
)

loan_term_chi_square

    Pearson's Chi-squared test with simulated p-value (based on 10000
    replicates)

data:  loan_term_default_table
X-squared = 9.9565, df = NA, p-value = 0.0215
Code
tidy(loan_term_chi_square) %>%
  kable(
    digits = 4,
    caption = "Chi Square Test: Repayment Status by Loan Term"
  )
Chi Square Test: Repayment Status by Loan Term
statistic p.value parameter method
9.9565 0.0215 NA Pearson’s Chi-squared test with simulated p-value
(based on 10000 replicates)
Code
loan_term_cramers_v <- tibble(
  effect_size = "Cramer's V",
  value = cramers_v_manual(loan_term_default_table)
)

kable(
  loan_term_cramers_v,
  digits = 4,
  caption = "Effect Size for Loan Term Chi Square Test"
)
Effect Size for Loan Term Chi Square Test
effect_size value
Cramer’s V 0.2352

10.6 Interpretation

The ANOVA test helps show whether loan size differs meaningfully across loan terms after adjusting for skewness with a log transformation. If the p value is below 0.05, the bank should not assume that all loan terms carry the same exposure pattern. Some tenors may be linked with larger loan amounts and may need closer portfolio monitoring.

The chi-square tests help show whether repayment outcome is associated with application channel or loan term. If a result is significant, the next business step is not an automatic policy change. It is a focused investigation into why the difference exists, such as customer profile, onboarding process, fraud checks, affordability quality, or repayment engagement.

11 8. Technique 4: Correlation Analysis

11.1 Theory Recap in This Case

Correlation analysis is used here to understand how the numeric credit variables move together in the cleaned portfolio. The variables reviewed include loan amount, income, current DPD, maximum DPD, credit score, default status, and repayment status. These variables are important because they relate directly to exposure, affordability, repayment stress, and credit risk.

The most important business relationship is the link between credit score and default status. If credit score has a meaningful relationship with default, it supports the use of score bands for risk segmentation. If the relationship is weak, it suggests that credit score should be combined with other signals rather than used alone.

11.2 Business Justification

For Renmoney, the correlation heatmap helps identify variables that may be useful for portfolio monitoring and model building. It also shows where variables may overlap. For example, current DPD and maximum DPD may carry similar information, which is useful for monitoring but may create interpretation issues if both are used in the same model without care.

11.3 Credit Score and Default Correlation

Code
credit_score_default_correlation <- cor.test(
  loan_clean_no_empty$credit_score,
  loan_clean_no_empty$is_bad,
  method = "pearson"
)

credit_score_default_correlation

    Pearson's product-moment correlation

data:  loan_clean_no_empty$credit_score and loan_clean_no_empty$is_bad
t = 9.8089, df = 178, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4883942 0.6797177
sample estimates:
      cor 
0.5923439 

11.4 Spearman Correlation Check

Code
credit_score_default_spearman <- cor.test(
  loan_clean_no_empty$credit_score,
  loan_clean_no_empty$is_bad,
  method = "spearman"
)

credit_score_default_spearman

    Spearman's rank correlation rho

data:  loan_clean_no_empty$credit_score and loan_clean_no_empty$is_bad
S = 398920, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.5895762 

11.5 Correlation Matrix Heatmap

Code
numeric_vars <- loan_clean_no_empty %>%
  select(
    loan_amount,
    log_loan_amount,
    income,
    current_dpd,
    max_dpd,
    credit_score,
    is_bad,
    repayment_good
  )

cor_matrix <- cor(
  numeric_vars,
  use = "complete.obs",
  method = "pearson"
)

ggcorrplot(
  cor_matrix,
  type = "lower",
  lab = TRUE,
  title = "Correlation Matrix of Numeric Loan Variables",
  colors = c("#b8dff2", "#ffffff", "#37352f")
)

Code
cor_pairs <- as.data.frame(as.table(cor_matrix)) %>%
  rename(
    variable_1 = Var1,
    variable_2 = Var2,
    correlation = Freq
  ) %>%
  filter(as.character(variable_1) < as.character(variable_2)) %>%
  mutate(abs_correlation = abs(correlation)) %>%
  arrange(desc(abs_correlation))

kable(
  head(cor_pairs, 5),
  digits = 4,
  caption = "Strongest Numeric Correlations"
)
Strongest Numeric Correlations
variable_1 variable_2 correlation abs_correlation
is_bad repayment_good -1.0000 1.0000
loan_amount log_loan_amount 0.8579 0.8579
credit_score is_bad 0.5923 0.5923
credit_score repayment_good -0.5923 0.5923
credit_score income -0.3253 0.3253

11.6 Interpretation

The heatmap helps show which variables are giving similar or related information about the loan portfolio. Strong relationships around current_dpd, max_dpd, is_bad, and repayment_good are expected because delinquency and default are part of loan performance. The more useful business question is whether application-stage variables such as credit score, income, and loan amount show enough relationship with default to support early risk segmentation.

This section also shows why timing matters. DPD variables may be strongly connected to default, but they are usually known after the loan has started performing. That makes them more suitable for portfolio monitoring than for initial approval decisions. For application-stage decisioning, the bank would need to focus more on variables available before disbursement.

12 9. Technique 5: Logistic Regression

12.1 Theory Recap in This Case

Logistic regression is used because the repayment outcome in this case is binary: each loan is either repaid or defaulted. The model estimates the probability of repayment using applicant and loan characteristics in the cleaned dataset. The dependent variable is repayment_good, where 1 means the loan was repaid and 0 means the loan defaulted.

In this analysis, the model includes loan amount, credit score, loan term, gender, income, application channel, DPD measures, and GPS availability. The goal is not to build a final production model, but to show how portfolio data can be translated into a repayment probability view.

12.2 Business Justification

The model is useful because it moves the analysis from group-level summaries to customer-level risk ranking. A repayment probability score can help identify accounts that may need closer review, stronger monitoring, or additional affordability checks. However, the model should remain a decision support tool until it is validated on separate data and reviewed for fairness and credit policy alignment.

Code
# dependants is excluded because it has only one value after cleaning.
# Including a constant variable causes aliased coefficients and breaks VIF.

repayment_model <- glm(
  repayment_good ~ loan_amount +
    credit_score +
    loan_term +
    gender +
    income +
    application_channel +
    current_dpd +
    max_dpd +
    has_gps,
  data = loan_clean_no_empty,
  family = binomial
)

summary(repayment_model)

Call:
glm(formula = repayment_good ~ loan_amount + credit_score + loan_term + 
    gender + income + application_channel + current_dpd + max_dpd + 
    has_gps, family = binomial, data = loan_clean_no_empty)

Coefficients:
                         Estimate Std. Error z value Pr(>|z|)    
(Intercept)             4.376e+00  8.910e-01   4.912 9.03e-07 ***
loan_amount            -1.992e-06  7.792e-06  -0.256    0.798    
credit_score           -8.031e+00  1.267e+00  -6.336 2.35e-10 ***
loan_term1WEEKS         1.429e+01  2.713e+03   0.005    0.996    
loan_term2MONTHS       -1.831e+01  2.681e+03  -0.007    0.995    
loan_term3MONTHS        1.589e+01  1.766e+03   0.009    0.993    
loan_term4MONTHS       -1.922e+01  3.956e+03  -0.005    0.996    
loan_term6MONTHS       -1.838e+01  3.956e+03  -0.005    0.996    
genderMale             -3.577e-01  4.599e-01  -0.778    0.437    
income                 -4.146e-08  8.728e-07  -0.047    0.962    
application_channeliOS  3.985e-01  6.688e-01   0.596    0.551    
current_dpd            -2.589e-04  7.932e-04  -0.326    0.744    
max_dpd                 4.279e-05  1.351e-04   0.317    0.751    
has_gps1                4.728e-01  6.932e-01   0.682    0.495    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 243.07  on 179  degrees of freedom
Residual deviance: 153.07  on 166  degrees of freedom
AIC: 181.07

Number of Fisher Scoring iterations: 16

12.3 Regression Table With Odds Ratios

Code
regression_results <- tidy(
  repayment_model,
  exponentiate = TRUE,
  conf.int = TRUE
)

kable(
  regression_results,
  digits = 4,
  caption = "Logistic Regression Results With Odds Ratios"
)
Logistic Regression Results With Odds Ratios
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 79.5341 0.8910 4.9117 0.0000 15.2162 5.133196e+02
loan_amount 1.0000 0.0000 -0.2556 0.7983 1.0000 1.000000e+00
credit_score 0.0003 1.2675 -6.3363 0.0000 0.0000 3.300000e-03
loan_term1WEEKS 1602331.0398 2712.7456 0.0053 0.9958 0.0000 NA
loan_term2MONTHS 0.0000 2681.0769 -0.0068 0.9946 NA 4.079211e+171
loan_term3MONTHS 7943397.9843 1766.3128 0.0090 0.9928 0.0000 NA
loan_term4MONTHS 0.0000 3956.1811 -0.0049 0.9961 NA Inf
loan_term6MONTHS 0.0000 3956.1858 -0.0046 0.9963 NA 5.238443e+152
genderMale 0.6993 0.4599 -0.7778 0.4367 0.2799 1.719400e+00
income 1.0000 0.0000 -0.0475 0.9621 1.0000 1.000000e+00
application_channeliOS 1.4896 0.6688 0.5958 0.5513 0.4261 6.178500e+00
current_dpd 0.9997 0.0008 -0.3265 0.7441 0.9972 1.001300e+00
max_dpd 1.0000 0.0001 0.3167 0.7515 1.0000 NA
has_gps1 1.6045 0.6932 0.6821 0.4952 0.4107 6.366100e+00

12.4 Significant Coefficients and Business Actions

Code
coefficient_actions <- regression_results %>%
  filter(term != "(Intercept)", p.value < 0.05) %>%
  mutate(
    direction = if_else(estimate > 1, "increases", "reduces"),
    business_action = paste0(
      "The variable ", term, " ", direction,
      " the odds of repayment. This should be reviewed as a possible risk signal for credit monitoring and decision support."
    )
  ) %>%
  select(term, estimate, p.value, direction, business_action)

if (nrow(coefficient_actions) == 0) {
  coefficient_actions <- tibble(
    term = "No statistically significant predictor at 5 percent level",
    estimate = NA_real_,
    p.value = NA_real_,
    direction = "Not applicable",
    business_action = "Use the model cautiously and validate with additional data before applying it to credit decisions."
  )
}

kable(
  coefficient_actions,
  digits = 4,
  caption = "Significant Coefficients and Business Actions"
)
Significant Coefficients and Business Actions
term estimate p.value direction business_action
credit_score 3e-04 0 reduces The variable credit_score reduces the odds of repayment. This should be reviewed as a possible risk signal for credit monitoring and decision support.

12.5 Multicollinearity Check

Code
alias_check <- alias(repayment_model)

alias_check
Model :
repayment_good ~ loan_amount + credit_score + loan_term + gender + 
    income + application_channel + current_dpd + max_dpd + has_gps
Code
vif_results <- tryCatch(
  {
    vif(repayment_model)
  },
  error = function(e) {
    message("VIF could not be calculated because the model has aliased coefficients.")
    message("This usually happens when a variable is constant or perfectly explained by another variable.")
    return(NULL)
  }
)

vif_results
                        GVIF Df GVIF^(1/(2*Df))
loan_amount         1.038804  1        1.019217
credit_score        1.095703  1        1.046758
loan_term           1.000005  5        1.000000
gender              1.055102  1        1.027181
income              1.065431  1        1.032197
application_channel 1.074471  1        1.036567
current_dpd         1.128359  1        1.062243
max_dpd             1.052975  1        1.026146
has_gps             1.053909  1        1.026601

12.6 Predicted Repayment Probability

Code
loan_clean_no_empty <- loan_clean_no_empty %>%
  mutate(
    predicted_repayment_probability = predict(
      repayment_model,
      type = "response"
    )
  )

loan_clean_no_empty %>%
  select(
    loan_id,
    repayment_status,
    repayment_good,
    predicted_repayment_probability
  ) %>%
  head(20) %>%
  kable(
    digits = 4,
    caption = "Sample Predicted Repayment Probabilities"
  )
Sample Predicted Repayment Probabilities
loan_id repayment_status repayment_good predicted_repayment_probability
8c73c3de-6018-444a-a28c-b1fcac2501dd Repaid 1 0.8488
687b3bc4-32d4-41e2-af39-94438d4ef88d Repaid 1 1.0000
10c36c2a-9345-47ae-96ca-e8cec04c5b9a Defaulted 0 0.3163
05d35da4-8282-4406-a644-0aa31cbc30b9 Defaulted 0 0.4654
2d6f8dab-9443-426a-8869-d3e9841c4f7c Defaulted 0 0.4666
dfaffd61-8ccd-4a63-ba47-6bf9f669f9b1 Repaid 1 1.0000
7ad79cd7-790c-461d-97be-d57ced8f9791 Defaulted 0 0.2991
41f3cef4-a66e-4e24-b45f-b34e1a18ebd5 Repaid 1 0.7819
357422b3-ba50-4ea7-9d87-3c00b3551226 Defaulted 0 0.7826
7784f590-0048-4960-9eda-fce6e1ddb951 Repaid 1 0.9159
f01eb3de-48a6-45fa-8051-cf9da587ea13 Repaid 1 0.5295
6f690a9e-8695-44dd-b15b-58e59d92c509 Defaulted 0 0.1196
ac31d581-0cb3-4f5e-ba80-951da571d100 Defaulted 0 0.1585
f40f4e0a-29e8-4e40-8def-25dc8821ed12 Defaulted 0 0.2382
f8caf6da-9aa6-4ead-8444-dd065db024ae Repaid 1 0.8339
980fe0c0-49b4-4dc7-91b9-169804a96dcf Repaid 1 0.8899
4bb62fdf-0cb0-4c93-b840-bcb0cae4ca87 Repaid 1 0.3885
1f71448c-c016-4131-b127-ab47af9694a3 Repaid 1 0.9474
324be21b-ab0b-4db8-ba4f-dca8ec85c1be Defaulted 0 0.0256
5bd26950-1f7e-4e39-a19d-22133f2e128d Defaulted 0 0.6226
Code
ggplot(loan_clean_no_empty, aes(x = predicted_repayment_probability)) +
  geom_histogram(bins = 30, fill = "#37352f", alpha = 0.85) +
  labs(
    title = "Predicted Repayment Probability",
    subtitle = "Distribution of model estimated repayment probabilities",
    x = "Predicted Probability of Repayment",
    y = "Number of Customers"
  )

12.7 Model Evaluation

Code
loan_clean_no_empty <- loan_clean_no_empty %>%
  mutate(
    predicted_class = if_else(
      predicted_repayment_probability >= 0.5,
      1,
      0
    )
  )

confusionMatrix(
  factor(loan_clean_no_empty$predicted_class, levels = c(0, 1)),
  factor(loan_clean_no_empty$repayment_good, levels = c(0, 1)),
  positive = "1"
)
Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 56 17
         1 17 90
                                          
               Accuracy : 0.8111          
                 95% CI : (0.7462, 0.8655)
    No Information Rate : 0.5944          
    P-Value [Acc > NIR] : 4.292e-10       
                                          
                  Kappa : 0.6082          
                                          
 Mcnemar's Test P-Value : 1               
                                          
            Sensitivity : 0.8411          
            Specificity : 0.7671          
         Pos Pred Value : 0.8411          
         Neg Pred Value : 0.7671          
             Prevalence : 0.5944          
         Detection Rate : 0.5000          
   Detection Prevalence : 0.5944          
      Balanced Accuracy : 0.8041          
                                          
       'Positive' Class : 1               
                                          
Code
roc_model <- roc(
  response = loan_clean_no_empty$repayment_good,
  predictor = loan_clean_no_empty$predicted_repayment_probability
)

plot(
  roc_model,
  col = "#37352f",
  main = "ROC Curve for Repayment Probability Model"
)

Code
auc(roc_model)
Area under the curve: 0.8793

12.8 Regression Diagnostic Plots

Code
loan_clean_no_empty <- loan_clean_no_empty %>%
  mutate(
    deviance_residuals = residuals(repayment_model, type = "deviance"),
    pearson_residuals = residuals(repayment_model, type = "pearson"),
    predicted_decile = ntile(predicted_repayment_probability, 10)
  )

calibration_table <- loan_clean_no_empty %>%
  group_by(predicted_decile) %>%
  summarise(
    average_predicted_repayment = mean(predicted_repayment_probability),
    actual_repayment_rate = mean(repayment_good),
    total_loans = n(),
    .groups = "drop"
  )

kable(
  calibration_table,
  digits = 4,
  caption = "Calibration Table by Predicted Probability Decile"
)
Calibration Table by Predicted Probability Decile
predicted_decile average_predicted_repayment actual_repayment_rate total_loans
1 0.0651 0.0000 18
2 0.1701 0.2778 18
3 0.2955 0.1667 18
4 0.4194 0.4444 18
5 0.5775 0.6667 18
6 0.7492 0.8333 18
7 0.8497 0.7222 18
8 0.8971 0.8889 18
9 0.9426 0.9444 18
10 0.9782 1.0000 18
Code
residual_plot <- ggplot(
  loan_clean_no_empty,
  aes(x = predicted_repayment_probability, y = deviance_residuals)
) +
  geom_point(alpha = 0.6, color = "#37352f") +
  geom_hline(yintercept = 0) +
  labs(
    title = "Residuals Versus Predicted Probability",
    x = "Predicted Repayment Probability",
    y = "Deviance Residuals"
  )

calibration_plot <- ggplot(
  calibration_table,
  aes(x = average_predicted_repayment, y = actual_repayment_rate)
) +
  geom_point(size = 3, color = "#37352f") +
  geom_abline(slope = 1, intercept = 0) +
  labs(
    title = "Calibration Plot",
    x = "Average Predicted Repayment",
    y = "Actual Repayment Rate"
  )

residual_plot | calibration_plot

12.9 Risk Ranking

Code
risk_ranking <- loan_clean_no_empty %>%
  select(
    loan_id,
    loan_amount,
    loan_term,
    income,
    credit_score,
    repayment_status,
    predicted_repayment_probability
  ) %>%
  arrange(predicted_repayment_probability)

risk_ranking %>%
  head(20) %>%
  kable(
    digits = 4,
    caption = "Top 20 Highest Risk Customers Based on Predicted Repayment Probability"
  )
Top 20 Highest Risk Customers Based on Predicted Repayment Probability
loan_id loan_amount loan_term income credit_score repayment_status predicted_repayment_probability
90450648-bb3e-421a-af42-e9743a77b4f2 300000 2MONTHS 586000.0000 0.5043 Defaulted 0.0000
0d70b559-0dc0-4389-ae92-f17cb4865e20 892000 6MONTHS 519714.2857 0.1755 Defaulted 0.0000
9bc5a0f6-ae4b-467d-870e-8ebb867c8130 300000 4MONTHS 1503841.0000 0.2617 Defaulted 0.0000
4895e841-a23a-4d87-a04d-30e2581db45b 50000 2MONTHS 451000.0000 0.3213 Defaulted 0.0000
324be21b-ab0b-4db8-ba4f-dca8ec85c1be 100000 1MONTHS 5.7400 0.9190 Defaulted 0.0256
8ee579b3-b4ee-4eb1-9d57-f407582fd89b 99400 1MONTHS 9333.3333 0.9200 Defaulted 0.0360
4166f3c2-c4e5-4c57-8e30-0a928cb0cfc4 50000 1MONTHS 19333.3333 0.8777 Defaulted 0.0567
4cc103e4-1173-4c00-bf11-0cee17580d98 50000 1MONTHS 1.3067 0.8701 Defaulted 0.0579
64889171-cc75-49ea-8375-aff847575479 62000 1MONTHS 18428.5714 0.7647 Defaulted 0.0833
c33b7d71-1a00-4180-a334-90b4f9d00e4b 150000 1MONTHS 72571.4286 0.7597 Defaulted 0.0842
77d12cb6-2e26-4559-8e21-b265091abbbb 54500 1MONTHS 123134.2857 0.7728 Defaulted 0.0848
ff650769-aa10-4c45-93b4-d767d975429e 50000 1MONTHS 10050.0000 0.7554 Defaulted 0.0912
d9a197b5-4357-4b2b-a7bc-c0be6d76f48c 46000 1MONTHS 8531.5000 0.7733 Defaulted 0.0914
7e250591-9a20-4a9a-b5ed-3dd9aae5eb33 70000 1MONTHS 274104.9629 0.8074 Defaulted 0.0978
27858b61-70d8-4aba-8b61-47a8d0e0475f 70000 1MONTHS 66880.0000 0.8104 Defaulted 0.0983
6f690a9e-8695-44dd-b15b-58e59d92c509 46600 1MONTHS 15823.3333 0.7276 Defaulted 0.1196
5b268633-4900-4e1f-9478-9c8a4485219a 45300 1MONTHS 43400.0000 0.7713 Defaulted 0.1205
5ad9af5c-1acb-4538-a153-a4f5f746b364 50000 1MONTHS 54766.6667 0.7216 Defaulted 0.1238
82dc9cc1-d8d0-4c6b-a2b9-ce6f85c97e3e 83000 1MONTHS 86000.0000 0.7073 Defaulted 0.1292
d5568bf9-5e28-45a7-a45b-851fd853dbdd 50000 1MONTHS 8159.6214 0.7674 Defaulted 0.1359

12.10 Interpretation

The regression model converts the cleaned loan records into predicted repayment probabilities. This makes the analysis more practical because the credit team can move from broad portfolio summaries to a customer-level risk ranking. Customers with lower predicted repayment probabilities can be flagged for review, closer monitoring, or additional affordability checks.

The coefficient table should be read as a guide to possible risk signals. A significant predictor does not mean the bank should approve or reject a customer based on that factor alone. It means the factor deserves attention in credit monitoring and further validation before any operational use.

13 10. Integrated Findings

The five analyses work together to build a practical view of portfolio risk. EDA first checks whether the data is usable and highlights issues such as missing values, skewed loan amounts, and repayment distribution. The visualisations then make the main patterns easier to see, especially default concentration by loan term, channel, and credit score band. Hypothesis testing adds statistical evidence to the patterns observed in the charts. Correlation analysis shows how numeric variables such as credit score, DPD, loan amount, and repayment outcome relate to one another. Finally, logistic regression brings the predictors together into one repayment probability model.

Together, the results support one main recommendation: Renmoney should use a structured risk segmentation approach that combines credit score bands, loan term performance, channel monitoring, and predicted repayment probability. This can support lending reviews and portfolio monitoring, while still remaining within existing credit policy, fairness checks, and proper model validation.

14 11. Limitations & Further Work

This analysis has a few limitations.

First, the dataset does not include interest rate or loan category, so the original pricing test could not be done directly. I used a substitute ANOVA to test whether loan amount differs across loan terms.

Second, removing rows with missing values reduced the sample size, although the final dataset still meets the requirement for this assessment.

Third, current DPD and maximum DPD may reflect behaviour after the loan was disbursed, rather than information available when the customer applied. This means they should be used carefully in any model intended for application-stage decisions.

Finally, the model was trained and evaluated on the same cleaned dataset, so the results should not be treated as final operational model performance.

With more data and time, I would request a fuller extract that includes application date, disbursement date, interest rate, loan category, approval status, repayment history, and more detailed affordability variables. I would also split the data into training and test sets, validate the model on out-of-time data, compare logistic regression with other models, and run fairness checks before recommending operational use.

15 References

The package references below were prepared using R’s citation("pkgname") function and then written in APA-style format for readability.

R Core Team. (2026). R: A language and environment for statistical computing. R Foundation for Statistical Computing.

Renmoney Microfinance Bank. (2026). Anonymised internal loan application and repayment dataset. Internal organisational dataset shared by Denis Burakov, Data Science Team Lead, for academic analysis. Raw dataset withheld due to confidentiality.

Course textbook. (n.d.). Data analytics textbook: Chapters 4, 5, 6, 8, 9, and 13. Course material.

Firke, S. (2023). janitor: Simple tools for examining and cleaning dirty data [R package]. The Comprehensive R Archive Network.

Fox, J., & Weisberg, S. (2019). An R companion to applied regression (3rd ed.). Sage.

Kuhn, M. (2008). Building predictive models in R using the caret package. Journal of Statistical Software, 28(5), 1-26.

Lüdecke, D., Ben-Shachar, M. S., Patil, I., Waggoner, P., & Makowski, D. (2021). performance: An R package for assessment, comparison and testing of statistical models. Journal of Open Source Software, 6(60), 3139.

Pedersen, T. L. (2024). patchwork: The composer of plots [R package]. The Comprehensive R Archive Network.

Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J. C., & Müller, M. (2011). pROC: An open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12, 77.

Robinson, D., Hayes, A., & Couch, S. (2023). broom: Convert statistical objects into tidy tibbles [R package]. The Comprehensive R Archive Network.

Spinu, V., Grolemund, G., & Wickham, H. (2023). lubridate: Make dealing with dates a little easier [R package]. The Comprehensive R Archive Network.

Waring, E., Quinn, M., McNamara, A., Arino de la Rubia, E., Zhu, H., & Ellis, S. (2022). skimr: Compact and flexible summaries of data [R package]. The Comprehensive R Archive Network.

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer.

Wickham, H., & Henry, L. (2023). forcats: Tools for working with categorical variables (factors) [R package]. The Comprehensive R Archive Network.

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., Takahashi, K., Vaughan, D., Wilke, C., Woo, K., & Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686.

Wilke, C. O. (2024). ggcorrplot: Visualization of a correlation matrix using ggplot2 [R package]. The Comprehensive R Archive Network.

Xie, Y. (2015). Dynamic documents with R and knitr (2nd ed.). Chapman and Hall/CRC.

16 Appendix A: Data Source Confirmation

The screenshot below confirms that the anonymised loan portfolio dataset was obtained from the organisation’s Data Analyst. It also confirms that is_bad is the outcome variable, that the dataset is a random sample of 1,000 digital applications between 2024-09-04 and 2026-03-22, and that the data does not contain sensitive customer information.

Screenshot confirming the internal data source, outcome variable, sample period, and data confidentiality note

17 Appendix B: AI Usage Statement

AI tools were used to support R code structuring and Quarto formatting. I reviewed the outputs, selected the analytical methods, interpreted the results, and made the final business recommendations myself. The dataset came from my organisation and was not generated by AI.

18 Appendix C: Exported Outputs

The code below exports the final analysis outputs for review. These files are generated from the anonymised dataset and do not include any personally identifiable information.

Code
write_csv(loan_clean_no_empty, "loan_analysis_final.csv")
write_csv(portfolio_summary, "portfolio_summary.csv")
write_csv(default_by_term, "default_by_loan_term.csv")
write_csv(default_by_channel, "default_by_application_channel.csv")
write_csv(default_by_gender, "default_by_gender.csv")
write_csv(default_by_score_band, "default_by_credit_score_band.csv")
write_csv(risk_ranking, "risk_ranking.csv")