Predictive Modelling with Logistic Regression on the German Credit Dataset
Author
Gousia Ain
Published
March 1, 2026
1 Motivation:
Banks face a fundamental challenge: how to maximize loan profits while minimizing default risk Every loan approved to a “bad” customer results in losses, while every loan denied to a “good” customer means lost revenue. This asymmetry is what makes credit risk modelling both analytically challenging and genuinely consequential. It is not simply a classification problem — it is a question of how financial opportunity is allocated, and who bears the cost when the decision is wrong. Therefore, building a predictive model helps minimize credit risk and improve decision-making. What motivates me about this problem is that behind every data point is a real person. Building accurate and fair models means fewer defaults for lenders, but also fewer people incorrectly denied credit they deserve. That combination of rigorous analysis and real-world impact is what drew me to this project.
2 Data provenance:
The German Credit dataset was originally compiled by Prof. Hans Hofmann of the University of Hamburg and donated to the UCI Machine Learning Repository in 1994. It is publicly available under a Creative Commons Attribution 4.0 (CC BY 4.0) licence, permitting free use and adaptation with appropriate credit.
An important note on cost asymmetry: the dataset’s original documentation specifies a cost matrix in which misclassifying a bad customer as good carries a penalty five times greater than the reverse error. This asymmetry underpins the business case for prioritising specificity alongside overall accuracy in model evaluation.
The dataset contains 1,000 loan applications described by 20 predictor variables — a mix of categorical and integer features — and one binary target variable classifying each applicant as a good or bad credit risk. No missing values are present in the data.
write.csv(german_data, "german_credit.csv", row.names =FALSE)# Assign meaningful column namescolnames(german_data) <-c("checking_status", "duration", "credit_history", "purpose","credit_amount", "savings", "employment", "installment_rate","personal_status_sex", "other_debtors", "residence_since","property", "age", "other_installment_plans", "housing","num_credits", "job", "num_dependents", "telephone","foreign_worker", "class")# Display variable names and their meaningsvariable_descriptions <-data.frame(Variable =colnames(german_data),Description =c("Status of existing checking account","Loan duration in months","Credit history","Purpose of loan","Credit amount in Deutsche Marks","Savings account/bonds","Employment duration","Installment rate (% of disposable income)","Personal status & sex","Other debtors/guarantors","Residence since (years)","Property","Age in years","Other installment plans","Housing situation","Number of existing credits","Job type","Number of dependents","Telephone ownership","Foreign worker status","Credit class (good/bad)" ))kable(variable_descriptions, caption ="Variable Descriptions")
Variable Descriptions
Variable
Description
checking_status
Status of existing checking account
duration
Loan duration in months
credit_history
Credit history
purpose
Purpose of loan
credit_amount
Credit amount in Deutsche Marks
savings
Savings account/bonds
employment
Employment duration
installment_rate
Installment rate (% of disposable income)
personal_status_sex
Personal status & sex
other_debtors
Other debtors/guarantors
residence_since
Residence since (years)
property
Property
age
Age in years
other_installment_plans
Other installment plans
housing
Housing situation
num_credits
Number of existing credits
job
Job type
num_dependents
Number of dependents
telephone
Telephone ownership
foreign_worker
Foreign worker status
class
Credit class (good/bad)
4 Data cleaning:
Before modelling, raw coded variables (e.g. A11, A14) were replaced with meaningful column names and all categorical variables were converted to factors. Without this step, R would treat category codes as numeric values, producing meaningless coefficient estimates and incorrect dummy variable encoding.
The barplot shows class imbalance, with 70% of observations classified as good credit and 30% as bad credit. This uneven distribution may cause the model to become biased toward the majority class and may affect our modeling strategy. Therefore, it is important to consider using techniques such as resampling to ensure the model can effectively learn from both classes.
**Table: Descriptive Statistics by Credit Class (Mean ± SD)**
Summary Statistics (Mean ± Standard Deviation)
Class
Loan Duration (months)
Credit Amount (DM)
Age (years)
Installment Rate (%)
good
19.2 ± 11.1
2985 ± 2401
36.2 ± 11.4
2.92 ± 1.13
bad
24.9 ± 13.3
3938 ± 3536
34.0 ± 11.2
3.10 ± 1.09
loan Duration: Customers with bad credit take loans for about 24.9 months on average, compared to 19.2 months for good credit customers. This is roughly a 5–6 month longer repayment period, which suggests longer loan terms may be associated with higher default risk.
Credit Amount: The average borrowed amount for bad credit customers is 3938 DM, while for good credit customers it is 2985 DM. This means risky customers borrow approximately 950 DM more on average, indicating that larger loan sizes may increase the probability of default. The higher standard deviation (3536 vs 2401) also shows more variability among risky borrowers.
Age: Good credit customers are slightly older on average (36.2 years) compared to bad credit customers (34.0 years). The difference is only about 2 years, and since the standard deviations are similar, age may not be an influencing factor.
Installment Rate: The average installment rate is 3.10% for bad credit and 2.92% for good credit, which is a very small difference. Given the similar spread in both groups, this variable likely has limited predictive power.
Overall Numerical Insight:
The most noticeable differences appear in loan duration and credit amount, where bad credit customers borrow more and repay over longer periods. Age and installment rate show only small mean differences, suggesting weaker influence on credit classification.
Show Code
##| message: false#| warning: false#| results: hide# Load libraries in correct order (MASS last)library(tidyverse)library(MASS)# Create long formatnumeric_vars_long <- german_data %>% dplyr::select(duration, credit_amount, age, installment_rate, class) %>% tidyr::pivot_longer(cols =c(duration, credit_amount, age, installment_rate),names_to ="variable",values_to ="value" )# Plotggplot(numeric_vars_long, aes(x = class, y = value, fill = class)) +geom_boxplot(alpha =0.7) +facet_wrap(~ variable, scales ="free", ncol =2) +labs(title ="Distribution of Numerical Variables by Credit Class",x ="Credit Class", y ="Value") +scale_fill_manual(values =c("good"="steelblue", "bad"="coral"))
Loan Size: It can be observed that customers classified as bad credit generally take higher loan amounts. The variability is also greater in this group, suggesting that larger borrowed amounts may be associated with increased repayment risk.
Repayment Period: The bad credit group tends to have longer repayment durations compared to the good credit group. This indicates that longer loan periods might contribute to a higher probability of default.
Installment Proportion: Both credit groups show very similar distributions for this variable. Therefore, it does not seem to play a major role in distinguishing between good and bad credit customers in this dataset.
5.3 Checking account status vs Credit risk
Show Code
# Checking status vs credit riskggplot(german_data, aes(x = checking_status, fill = class)) +geom_bar(position ="fill") +labs(subtitle ="A11: <0 DM | A12: 0-200 DM | A13: >200 DM | A14: No account",x ="Checking Account Status", y ="Proportion") +scale_fill_manual(values =c("good"="steelblue", "bad"="coral")) +coord_flip()
Customers with very low balances (<0 DM) show the highest proportion of bad credit, indicating strong association between poor account status and higher default risk. As account balance improves (especially >200 DM), the proportion of good credit increases significantly, suggesting checking account status is a strong predictor of credit risk
5.4 Credit history vs Credit risk
Show Code
# Credit history analysisggplot(german_data, aes(x = credit_history, fill = class)) +geom_bar(position ="fill") +labs(title ="Credit History vs Credit Risk",x ="Credit History Category", y ="Proportion") +scale_fill_manual(values =c("good"="steelblue", "bad"="coral"))
Borrowers with critical credit history (A34) have the lowest default rate, while those with no credits taken (A30) and delayed payments (A31) show the highest proportion of bad credit outcomes.
5.5 Loan purpose vs Credit risk
Show Code
# Purpose of loanggplot(german_data, aes(x = purpose, fill = class)) +geom_bar(position ="fill") +labs(x ="Purpose", y ="Proportion") +scale_fill_manual(values =c("good"="steelblue", "bad"="coral")) +theme(axis.text.x =element_text(angle =45, hjust =1))
Used car loans (A41) and retraining purposes (A48) have the lowest default rates, while domestic appliances (A46) and other purposes (A45) show the highest proportion of bad credit outcomes — suggesting loan purpose is a meaningful signal of repayment risk.
6 Logistic Regression :
Logistic regression was chosen as it is well-suited for binary classification problems, interpretable, and widely used in credit risk modelling where understanding the direction and magnitude of each predictor’s effect is important
Model assumption: Logistic regression assumes a linear relationship between predictors and the log-odds of the outcome. Diagnostic plots are examined below to verify this assumption holds.
6.1 Full model:
Show Code
# Full model with all predictorsmodel_full <-glm( class ~ .,data = german_data,family = binomial)# Diagnostic plots - Full Modelpar(mfrow =c(2,2))plot(model_full, main ="Full Model Diagnostics")
Show Code
par(mfrow = (c(1,1)))
Regression diagnostics were examined for the full model. - The residuals vs fitted and scale-location plots display the characteristic butterfly pattern expected in logistic regression due to the binary outcome structure — this is not a violation of model assumptions. - The Q-Q plot shows reasonable conformity along the diagonal with minor deviation in the upper tail, suggesting a small number of poorly fitted observations. - The residuals vs leverage plot identifies observations 204, 736, and 819 as having relatively high leverage, warranting further investigation as potential influential points. Overall, no assumption violations were detected that would fundamentally undermine the model’s validity.
6.2 Stepwise model:
Show Code
# Stepwise model selectionlibrary(MASS)model_aic <-stepAIC( model_full,direction ="backward",trace =FALSE)# Diagnostic plots - Stepwise Model par(mfrow =c(2,2))plot(model_aic,main ="AIC Model Diagnostics")
Show Code
# Reset plot layoutpar(mfrow =c(1,1))
Diagnostic plots for the AIC stepwise model closely mirror those of the full model, confirming that variable reduction did not introduce new assumption violations. The residuals vs fitted and scale-location plots display the expected logistic regression structure in both cases. Notably, the leverage plot reveals that observation 204 remains influential across both models — suggesting it represents a genuinely unusual case in the data — while observation 736, flagged in the full model, no longer appears after stepwise selection, indicating its leverage was tied to a removed predictor. Observation 158 emerges as a new leverage point in the AIC model and warrants further inspection alongside observation 204.
The stepwise AIC model removes six predictors — employment, residence_since, property, num_credits, job, and num_dependents — that did not contribute meaningfully to model fit. The result is a leaner model with lower AIC (982.5 vs 993.8) and nearly identical residual deviance (910.5 vs 895.8), confirming that the removed variables added complexity without improving prediction. The remaining coefficients are largely consistent in direction and magnitude across both models, suggesting the stepwise selection was stable and the full model was not substantially distorted by the presence of irrelevant predictors. All further interpretation is based on the AIC-selected model, chosen for its parsimony while maintaining goodness of fit, reducing overfitting , and enhancing generalisability to unseen data.
7 Model result and interpretation:
7.1 AIC- stepwise model coefficient plot
Show Code
# Extract coefficients from AIC modelcoef_df <-as.data.frame(summary(model_aic)$coefficients)coef_df$variable <-rownames(coef_df)coef_df <- coef_df[coef_df$variable !="(Intercept)", ]coef_df <- coef_df[order(coef_df$Estimate), ]# Keep only significant predictors (p < 0.05)sig_coef <- coef_df[coef_df$`Pr(>|z|)`<0.05, ]# Plot coefficients with confidence intervalsggplot(sig_coef,aes(x =reorder(variable, Estimate), y = Estimate)) +geom_point(size =3, color ="steelblue") +geom_errorbar(aes(ymin = Estimate -1.96*`Std. Error`,ymax = Estimate +1.96*`Std. Error`),width =0.2, color ="gray50" ) +geom_hline(yintercept =0, linetype ="dashed", color ="red") +coord_flip() +labs(title ="Significant Predictors of Credit Risk",subtitle ="Positive values increase probability of 'good' credit",x ="Predictor Variables",y ="Coefficient Estimate (Log-odds)" )
The coefficient plot illustrates customers with higher installment rates, longer loan durations, and larger credit amounts show positive coefficients, meaning they’re actually more likely to be good credit risks—possibly because they may have been pre-screened as stable borrowers, or they might be taking longer loans precisely because they can manage payments comfortably. However, the strongest predictors of default risk : no checking account (A14), used car purchases (A41), critical credit history (A34), foreign worker status (A202), and no savings (A65, A64), all of which have negative coefficients pushing toward bad credit
7.2 Odds-Ratio plot
Show Code
# Convert to odds ratiossig_coef$OR <-exp(sig_coef$Estimate)sig_coef$CI_lower <-exp(sig_coef$Estimate -1.96* sig_coef$`Std. Error`)sig_coef$CI_upper <-exp(sig_coef$Estimate +1.96* sig_coef$`Std. Error`)# Plot odds ratiosggplot(sig_coef,aes(x =reorder(variable, OR), y = OR)) +geom_point(size =3, color ="steelblue") +geom_errorbar(aes(ymin = CI_lower, ymax = CI_upper),width =0.2, color ="gray50" ) +geom_hline(yintercept =1, linetype ="dashed", color ="red") +coord_flip() +labs(title ="Odds Ratios: How Each Factor Affects Credit Risk",subtitle ="Values >1 increase good credit probability | Values <1 decrease it",x ="Predictor Variables",y ="Odds Ratio (with 95% CI)" )
Only installment_rate shows a confidently protective effect (OR ≈ 1.4), while duration and credit_amount sit near 1.0 The strongest and most reliable risk factors are checking_statusA14 (OR ≈ 0.18), purposeA41 (OR ≈ 0.19), and credit_historyA34 (OR ≈ 0.24), all with narrow confidence intervals. The remaining predictors cluster between 0.3 and 0.5, representing a moderate risk group, though foreign_workerA202 and savingsA64 carry wide intervals and should be interpreted cautiously.
Generalized Linear Model
1000 samples
20 predictor
2 classes: 'good', 'bad'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 800, 800, 800, 800, 800
Resampling results:
ROC Sens Spec
0.7774762 0.8714286 0.4766667
Five-fold cross-validation yielded an AUC of 0.778, confirming the model generalises reasonably well to unseen data. However, the results reveal a notable imbalance: sensitivity of 87% indicates the model is effective at identifying creditworthy customers, but specificity of only 48% means the majority of actual defaulters are misclassified as good.
This asymmetry stems from the class imbalance in the dataset (70% good, 30% bad) and has significant business implications — missed defaults represent direct financial losses. Future work should explore resampling techniques such as SMOTE or cost-sensitive learning to improve the model’s ability to detect bad credit customers without sacrificing too much sensitivity.
7.4 ROC full curve
Show Code
library(pROC)# Predictionsprob_aic <-predict(model_aic, german_data, type ="response")prob_full <-predict(model_full, german_data, type ="response")# ROC objectsroc_aic <-roc(german_data$class, prob_aic, levels =c("good", "bad"), direction ="<")roc_full <-roc(german_data$class, prob_full, levels =c("good", "bad"), direction ="<")# Plot both ROC curvesplot(roc_full, col ="gray", lwd =2, main ="ROC Curve Comparison")plot(roc_aic, col ="steelblue", lwd =2, add =TRUE)# Add AUC valueslegend("bottomright",legend =c(paste("Full Model AUC:", round(auc(roc_full), 3)),paste("AIC Model AUC:", round(auc(roc_aic), 3))),col =c("gray", "steelblue"),lwd =2)abline(a =0, b =1, lty =2, col ="red")
The ROC curve comparison reveals that the AIC-selected stepwise model (AUC = 0.828) performs nearly identically to the full model (AUC = 0.834) despite using considerably fewer predictors — a difference of just 0.006 AUC that is negligible in practice. Visually, the two curves are almost indistinguishable. This supports selecting the stepwise model on grounds of parsimony and interpretability, both critical considerations in regulated credit risk environments. Notably, both in-sample AUCs exceed the cross-validated estimate of 0.778, indicating some degree of overfitting — the CV result should be treated as the more realistic performance benchmark for deployment.
Using the optimal classification threshold identified from the ROC curve, the model achieves 75.3% accuracy — significantly above the no-information rate of 70% (p < 0.001). Crucially, the model demonstrates balanced performance with sensitivity and specificity both near 75%, a substantial improvement over the raw cross-validated specificity of 48%. Of the 300 actual bad credit customers, 225 are correctly identified, while 75 are misclassified as good — representing the highest-cost errors from a business perspective. The positive predictive value of 87.6% indicates that loan approvals from this model are highly reliable, though the negative predictive value of 56.7% suggests flagged rejections should be reviewed manually rather than automatically declined.
8 Business insights:
Show Code
business_insights <-data.frame(Risk_Factor =c("No checking account (A14)","Used car purchase (A41)","Critical credit history (A34)","No savings / unknown savings (A65)","Mid-range savings 500-1000 DM (A64)","No other installment plans (A143)","Foreign worker status (A202)","High installment rate","Borderline model predictions" ),Model_Evidence =c("Strongest predictor — OR ≈ 0.18, p < 0.001","Second strongest — OR ≈ 0.19, p < 0.001","OR ≈ 0.24, p = 0.001 — reliable estimate","OR ≈ 0.39, p < 0.001","OR ≈ 0.27, p = 0.01 — wide CI, treat cautiously","OR ≈ 0.52, p = 0.007","OR ≈ 0.25, p = 0.03 — very wide CI","OR ≈ 1.39, p < 0.001 — only confidently protective factor","Model NPV = 56.7% — 'bad' predictions correct just over half the time" ),Impact =c("Highest Risk","Very High Risk","Very High Risk","High Risk","High Risk — uncertain","Moderate Risk","High Risk — flag for fairness review","Protective","Model Uncertainty" ),Business_Action =c("Require collateral or guarantor before approval; no checking account removes key repayment monitoring tool","Require larger down payment; used cars depreciate rapidly and offer weak collateral value","Request full credit report and alternative income evidence; do not auto-reject — review manually","Verify alternative assets (property, pension); absence of savings removes financial buffer for repayment shocks","Estimate is unstable — do not over-penalise; request savings documentation before decision","Request explanation of financial obligations; absence of existing plans may signal income instability","Verify employment contract and stability; note: demographic factors require consistent application to comply with fair lending regulations","Treat as positive repayment capacity signal; customers committing higher income share demonstrate financial confidence","Do not auto-reject borderline cases — refer to human underwriter; automated rejection risks wrongly denying credit to 4 in 10 flagged customers" ))kable(business_insights, caption ="Business Recommendations Derived from Model Evidence",col.names =c("Risk Factor", "Model Evidence", "Impact", "Recommended Action"))
Business Recommendations Derived from Model Evidence
Risk Factor
Model Evidence
Impact
Recommended Action
No checking account (A14)
Strongest predictor — OR ≈ 0.18, p < 0.001
Highest Risk
Require collateral or guarantor before approval; no checking account removes key repayment monitoring tool
Used car purchase (A41)
Second strongest — OR ≈ 0.19, p < 0.001
Very High Risk
Require larger down payment; used cars depreciate rapidly and offer weak collateral value
Critical credit history (A34)
OR ≈ 0.24, p = 0.001 — reliable estimate
Very High Risk
Request full credit report and alternative income evidence; do not auto-reject — review manually
No savings / unknown savings (A65)
OR ≈ 0.39, p < 0.001
High Risk
Verify alternative assets (property, pension); absence of savings removes financial buffer for repayment shocks
Mid-range savings 500-1000 DM (A64)
OR ≈ 0.27, p = 0.01 — wide CI, treat cautiously
High Risk — uncertain
Estimate is unstable — do not over-penalise; request savings documentation before decision
No other installment plans (A143)
OR ≈ 0.52, p = 0.007
Moderate Risk
Request explanation of financial obligations; absence of existing plans may signal income instability
Foreign worker status (A202)
OR ≈ 0.25, p = 0.03 — very wide CI
High Risk — flag for fairness review
Verify employment contract and stability; note: demographic factors require consistent application to comply with fair lending regulations
High installment rate
OR ≈ 1.39, p < 0.001 — only confidently protective factor
Protective
Treat as positive repayment capacity signal; customers committing higher income share demonstrate financial confidence
Borderline model predictions
Model NPV = 56.7% — ‘bad’ predictions correct just over half the time
Model Uncertainty
Do not auto-reject borderline cases — refer to human underwriter; automated rejection risks wrongly denying credit to 4 in 10 flagged customers
9 Limitations:
1. No holdout test set
The confusion matrix and ROC curve were evaluated on the same data the model was trained on, which means performance is likely slightly optimistic. The 5-fold cross-validated AUC of 0.778 is the more honest estimate of how the model would perform on new loan applications. In future work, I would set aside a dedicated test set before any model training to get a fully independent evaluation.
2. Class imbalance
The dataset contains 700 good and 300 bad credit customers. This imbalance pushes the model toward predicting “good” by default, which is reflected in the low negative predictive value of 56.7%. Techniques like SMOTE or cost-sensitive learning could help the model better identify defaulters
3. Suppressor effects in continuous variables
Duration and credit amount showed positive coefficients in the model despite the EDA showing bad customers tend to have longer loans and larger amounts. This contradiction likely reflects multicollinearity with other predictors rather than a genuine protective effect.
4. Dataset age and context
The German Credit dataset originates from the 1970s and reflects a specific historical and economic context that may not generalise to modern lending. Spending patterns, financial products, and borrower demographics have changed substantially since then.
5. Ethical concern with demographic variables
Foreign worker status was statistically significant in the model, but using demographic characteristics in credit decisions raises serious fairness and legal concerns under modern equal credit opportunity regulations. In a real deployment, this variable would need to be reviewed carefully or removed entirely.
10 Conclusion :
This analysis set out to identify the key drivers of credit default risk using the German Credit dataset. The stepwise logistic regression model emerged as the preferred approach because the AUC is slightly better than full model regression (AUC 0.828 vs 0.834) with fewer predictors, lower AIC, and greater interpretability.
The strongest and most actionable finding is that the absence of financial footprint — no checking account, no savings, no credit history — is the clearest signal of default risk. Customers cannot be assessed reliably when there is nothing to assess.
That said, the model has real limitations. The cross-validated AUC of 0.778 is the honest performance estimate, and the negative predictive value of 56.7% means the model should never be used to automatically reject applicants. It is a decision-support tool, not a decision-maker.
If I were to extend this work, I would implement a formal train/test split, explore SMOTE to address class imbalance, and test a gradient boosting model to capture non-linear interactions the logistic regression cannot. The business recommendations derived here provide a foundation — but responsible deployment would require further validation on more recent, representative data.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning with Applications in R (2nd ed.). Springer.
R Core Team. (2024). R: A Language and Environment for Statistical Computing (Version 4.4.2). R Foundation for Statistical Computing.
Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer.
Kuhn, M. (2008). Building Predictive Models in R Using the caret Package. Journal of Statistical Software, 28(5). https://doi.org/10.18637/jss.v028.i05
Robin, X., et al. (2011). pROC: An open-source package for R and S+. BMC Bioinformatics, 12, 77.
Venables, W. N., & Ripley, B. D. (2002). Modern Applied Statistics with S (4th ed.). Springer.
Source Code
---title: "Credit Risk Analysis"subtitle: "Predictive Modelling with Logistic Regression on the German Credit Dataset"author: "Gousia Ain"date: todayformat: html: theme: flatly toc: true toc-depth: 3 toc-title: "📋 Contents" toc-location: left toc-expand: true number-sections: true code-fold: true code-summary: "Show Code" code-tools: true code-block-bg: true code-block-border-left: "#2c7bb6" highlight-style: github fig-width: 9 fig-height: 6 fig-align: center fig-cap-location: bottom df-print: kable smooth-scroll: true page-layout: full fontsize: 1.05em linestretch: 1.7 embed-resources: true css: styles.cssexecute: echo: true warning: false message: false cache: false---## Motivation:Banks face a fundamental challenge: **how to maximize loan profits while minimizing default risk**Every loan approved to a "bad" customer results in losses, while every loan denied to a "good" customer means lost revenue.This asymmetry is what makes credit risk modelling both analytically challenging and genuinely consequential. It is not simply a classification problem — it is a question of how financial opportunity is allocated, and who bears the cost when the decision is wrong. Therefore, building a predictive model helps minimize credit risk and improve decision-making.What motivates me about this problem is that behind every data point is a real person. Building accurate and fair models means fewer defaults for lenders, but also fewer people incorrectly denied credit they deserve. That combination of rigorous analysis and real-world impact is what drew me to this project.## Data provenance:The German Credit dataset was originally compiled by **Prof. Hans Hofmann** of the University of Hamburg and donated to the UCI Machine Learning Repository in 1994. It is publicly available under a **Creative Commons Attribution 4.0 (CC BY 4.0)** licence, permitting free use and adaptation with appropriate credit.An important note on cost asymmetry: the dataset's original documentation specifies a cost matrix in which misclassifying a bad customer as good carries a penalty **five times greater** than the reverse error. This asymmetry underpins the business case for prioritising specificity alongside overall accuracy in model evaluation.> Hofmann, H. (1994). *Statlog (German Credit Data)* [Dataset]. > UCI Machine Learning Repository. > https://doi.org/10.24432/C5NC77## Data description: The dataset contains **1,000 loan applications** described by 20 predictor variables — a mix of categorical and integer features — and one binary target variable classifying each applicant as a good or bad credit risk. No missing values are present in the data.```{r}#| message: false#| warning: false# Load required librarieslibrary(tidyverse)library(knitr)# Load the datagerman_data <-read.table("german.data", header =FALSE)# Display basic infocat("Dataset Shape:", dim(german_data)[1], "rows,", dim(german_data)[2], "columns\n")``````{r}write.csv(german_data, "german_credit.csv", row.names =FALSE)# Assign meaningful column namescolnames(german_data) <-c("checking_status", "duration", "credit_history", "purpose","credit_amount", "savings", "employment", "installment_rate","personal_status_sex", "other_debtors", "residence_since","property", "age", "other_installment_plans", "housing","num_credits", "job", "num_dependents", "telephone","foreign_worker", "class")# Display variable names and their meaningsvariable_descriptions <-data.frame(Variable =colnames(german_data),Description =c("Status of existing checking account","Loan duration in months","Credit history","Purpose of loan","Credit amount in Deutsche Marks","Savings account/bonds","Employment duration","Installment rate (% of disposable income)","Personal status & sex","Other debtors/guarantors","Residence since (years)","Property","Age in years","Other installment plans","Housing situation","Number of existing credits","Job type","Number of dependents","Telephone ownership","Foreign worker status","Credit class (good/bad)" ))kable(variable_descriptions, caption ="Variable Descriptions")```## Data cleaning:Before modelling, raw coded variables (e.g. `A11`, `A14`) were replaced with meaningful column names and all categorical variables were converted to factors. Without this step, R would treat category codes as numeric values, producing meaningless coefficient estimates and incorrect dummy variable encoding.```{r}# Convert categorical variables to factorscategorical_vars <-c("checking_status","credit_history","purpose","savings","employment","personal_status_sex","other_debtors","property","other_installment_plans","housing","job","telephone","foreign_worker")german_data[categorical_vars] <-lapply(german_data[categorical_vars], factor)# Recode target variablegerman_data$class <-factor(german_data$class, levels =c(1, 2),labels =c("good", "bad"))```## EDA : Exploratory Data Analysis### Distribution of credit classes```{r}# Class distributionclass_dist <-table(german_data$class)class_prop <-prop.table(class_dist) *100# Plotggplot(german_data, aes(x = class, fill = class)) +geom_bar() +geom_text(stat='count', aes(label=..count..), vjust=-0.5) +labs(title ="Distribution of Credit Classes",subtitle =paste0("Good: ", round(class_prop[1], 1), "% | Bad: ", round(class_prop[2], 1), "%"),x ="Credit Class", y ="Count") +theme_minimal() +scale_fill_manual(values =c("good"="steelblue", "bad"="coral"))```The barplot shows class imbalance, with 70% of observations classified as good credit and 30% as bad credit. This uneven distribution may cause the model to become biased toward the majority class and may affect our modeling strategy. Therefore, it is important to consider using techniques such as resampling to ensure the model can effectively learn from both classes.### summary statistics by classes:```{r}#| message: false#| warning: false#| fig.height: 6#| fig.width: 10library(tidyverse)library(knitr)library(kableExtra)# Calculate summary statisticssummary_stats <- german_data %>%group_by(class) %>%summarise(avg_duration =mean(duration),sd_duration =sd(duration),avg_credit_amount =mean(credit_amount),sd_credit_amount =sd(credit_amount),avg_age =mean(age),sd_age =sd(age),avg_installment_rate =mean(installment_rate),sd_installment_rate =sd(installment_rate) )# Create formatted summary table with explicit dplyr::selectsummary_table <- summary_stats %>%mutate(`Loan Duration (months)`=sprintf("%.1f ± %.1f", avg_duration, sd_duration),`Credit Amount (DM)`=sprintf("%.0f ± %.0f", avg_credit_amount, sd_credit_amount),`Age (years)`=sprintf("%.1f ± %.1f", avg_age, sd_age),`Installment Rate (%)`=sprintf("%.2f ± %.2f", avg_installment_rate, sd_installment_rate) ) %>% dplyr::select(Class = class, `Loan Duration (months)`, `Credit Amount (DM)`,`Age (years)`, `Installment Rate (%)`)# Display with kableExtrasummary_table %>%kable(caption ="**Table: Descriptive Statistics by Credit Class (Mean ± SD)**",align =c("l", "c", "c", "c", "c"),linesep ="" ) %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed", "responsive"),full_width =FALSE,position ="center",font_size =13 ) %>%column_spec(1, bold =TRUE, background ="skyblue") %>%column_spec(2:5, width ="4cm") %>%add_header_above(c(" "=1, "Summary Statistics (Mean ± Standard Deviation)"=4)) ```**loan Duration:**Customers with bad credit take loans for about 24.9 months on average, compared to 19.2 months for good credit customers. This is roughly a 5–6 month longer repayment period, which suggests longer loan terms may be associated with higher default risk.**Credit Amount:**The average borrowed amount for bad credit customers is 3938 DM, while for good credit customers it is 2985 DM. This means risky customers borrow approximately 950 DM more on average, indicating that larger loan sizes may increase the probability of default. The higher standard deviation (3536 vs 2401) also shows more variability among risky borrowers.**Age:**Good credit customers are slightly older on average (36.2 years) compared to bad credit customers (34.0 years). The difference is only about 2 years, and since the standard deviations are similar, age may not be an influencing factor.**Installment Rate:**The average installment rate is 3.10% for bad credit and 2.92% for good credit, which is a very small difference. Given the similar spread in both groups, this variable likely has limited predictive power.**Overall Numerical Insight:**The most noticeable differences appear in loan duration and credit amount, where bad credit customers borrow more and repay over longer periods. Age and installment rate show only small mean differences, suggesting weaker influence on credit classification.```{r}##| message: false#| warning: false#| results: hide# Load libraries in correct order (MASS last)library(tidyverse)library(MASS)# Create long formatnumeric_vars_long <- german_data %>% dplyr::select(duration, credit_amount, age, installment_rate, class) %>% tidyr::pivot_longer(cols =c(duration, credit_amount, age, installment_rate),names_to ="variable",values_to ="value" )# Plotggplot(numeric_vars_long, aes(x = class, y = value, fill = class)) +geom_boxplot(alpha =0.7) +facet_wrap(~ variable, scales ="free", ncol =2) +labs(title ="Distribution of Numerical Variables by Credit Class",x ="Credit Class", y ="Value") +scale_fill_manual(values =c("good"="steelblue", "bad"="coral"))```**Loan Size: **It can be observed that customers classified as bad credit generally take higher loan amounts. The variability is also greater in this group, suggesting that larger borrowed amounts may be associated with increased repayment risk.**Repayment Period:**The bad credit group tends to have longer repayment durations compared to the good credit group. This indicates that longer loan periods might contribute to a higher probability of default.**Installment Proportion:**Both credit groups show very similar distributions for this variable. Therefore, it does not seem to play a major role in distinguishing between good and bad credit customers in this dataset.### Checking account status vs Credit risk```{r}# Checking status vs credit riskggplot(german_data, aes(x = checking_status, fill = class)) +geom_bar(position ="fill") +labs(subtitle ="A11: <0 DM | A12: 0-200 DM | A13: >200 DM | A14: No account",x ="Checking Account Status", y ="Proportion") +scale_fill_manual(values =c("good"="steelblue", "bad"="coral")) +coord_flip()```Customers with very low balances (<0 DM) show the highest proportion of bad credit, indicating strong association between poor account status and higher default risk. As account balance improves (especially >200 DM), the proportion of good credit increases significantly, suggesting checking account status is a strong predictor of credit risk### Credit history vs Credit risk ```{r}# Credit history analysisggplot(german_data, aes(x = credit_history, fill = class)) +geom_bar(position ="fill") +labs(title ="Credit History vs Credit Risk",x ="Credit History Category", y ="Proportion") +scale_fill_manual(values =c("good"="steelblue", "bad"="coral")) ```Borrowers with critical credit history (A34) have the lowest default rate, while those with no credits taken (A30) and delayed payments (A31) show the highest proportion of bad credit outcomes.### Loan purpose vs Credit risk```{r}# Purpose of loanggplot(german_data, aes(x = purpose, fill = class)) +geom_bar(position ="fill") +labs(x ="Purpose", y ="Proportion") +scale_fill_manual(values =c("good"="steelblue", "bad"="coral")) +theme(axis.text.x =element_text(angle =45, hjust =1))```Used car loans (A41) and retraining purposes (A48) have the lowest default rates, while domestic appliances (A46) and other purposes (A45) show the highest proportion of bad credit outcomes — suggesting loan purpose is a meaningful signal of repayment risk.## Logistic Regression :Logistic regression was chosen as it is well-suited for binary classification problems, interpretable, and widely used in credit risk modelling where understanding the direction and magnitude of each predictor's effect is important**Model assumption:** Logistic regression assumes a linear relationship between predictors and the log-odds of the outcome. Diagnostic plots are examined below to verify this assumption holds. ### Full model:```{r}# Full model with all predictorsmodel_full <-glm( class ~ .,data = german_data,family = binomial)# Diagnostic plots - Full Modelpar(mfrow =c(2,2))plot(model_full, main ="Full Model Diagnostics")par(mfrow = (c(1,1)))```Regression diagnostics were examined for the full model. - The residuals vs fitted and scale-location plots display the characteristic butterfly pattern expected in logistic regression due to the binary outcome structure — this is not a violation of model assumptions.- The Q-Q plot shows reasonable conformity along the diagonal with minor deviation in the upper tail, suggesting a small number of poorly fitted observations. - The residuals vs leverage plot identifies observations 204, 736, and 819 as having relatively high leverage, warranting further investigation as potential influential points.Overall, no assumption violations were detected that would fundamentally undermine the model's validity.### Stepwise model:```{r}# Stepwise model selectionlibrary(MASS)model_aic <-stepAIC( model_full,direction ="backward",trace =FALSE)# Diagnostic plots - Stepwise Model par(mfrow =c(2,2))plot(model_aic,main ="AIC Model Diagnostics")# Reset plot layoutpar(mfrow =c(1,1))```Diagnostic plots for the AIC stepwise model closely mirror those of the full model, confirming that variable reduction did not introduce new assumption violations. The residuals vs fitted and scale-location plots display the expected logistic regression structure in both cases. Notably, the leverage plot reveals that observation 204 remains influential across both models — suggesting it represents a genuinely unusual case in the data — while observation 736, flagged in the full model, no longer appears after stepwise selection, indicating its leverage was tied to a removed predictor. Observation 158 emerges as a new leverage point in the AIC model and warrants further inspection alongside observation 204.### Compare models:```{r}# Compare modelsmodel_comparison <-data.frame(Model =c("Full Model", "stepwise Model"),AIC =c(AIC(model_full), AIC(model_aic)),Variables =c(length(coef(model_full)), length(coef(model_aic))))kable(model_comparison, caption ="Model Comparison")```The stepwise AIC model removes six predictors — `employment`, `residence_since`, `property`, `num_credits`, `job`, and `num_dependents` — that did not contribute meaningfully to model fit. The result is a leaner model with lower AIC (982.5 vs 993.8) and nearly identical residual deviance (910.5 vs 895.8), confirming that the removed variables added complexity without improving prediction. The remaining coefficients are largely consistent in direction and magnitude across both models, suggesting the stepwise selection was stable and the full model was not substantially distorted by the presence of irrelevant predictors. All further interpretation is based on the AIC-selected model, chosen for its parsimony while maintaining goodness of fit, reducing overfitting , and enhancing generalisability to unseen data.## Model result and interpretation:### AIC- stepwise model coefficient plot```{r}# Extract coefficients from AIC modelcoef_df <-as.data.frame(summary(model_aic)$coefficients)coef_df$variable <-rownames(coef_df)coef_df <- coef_df[coef_df$variable !="(Intercept)", ]coef_df <- coef_df[order(coef_df$Estimate), ]# Keep only significant predictors (p < 0.05)sig_coef <- coef_df[coef_df$`Pr(>|z|)`<0.05, ]# Plot coefficients with confidence intervalsggplot(sig_coef,aes(x =reorder(variable, Estimate), y = Estimate)) +geom_point(size =3, color ="steelblue") +geom_errorbar(aes(ymin = Estimate -1.96*`Std. Error`,ymax = Estimate +1.96*`Std. Error`),width =0.2, color ="gray50" ) +geom_hline(yintercept =0, linetype ="dashed", color ="red") +coord_flip() +labs(title ="Significant Predictors of Credit Risk",subtitle ="Positive values increase probability of 'good' credit",x ="Predictor Variables",y ="Coefficient Estimate (Log-odds)" ) ```The coefficient plot illustrates customers with higher installment rates, longer loan durations, and larger credit amounts show positive coefficients, meaning they're actually more likely to be good credit risks—possibly because they may have been pre-screened as stable borrowers, or they might be taking longer loans precisely because they can manage payments comfortably. However, the strongest predictors of default risk : no checking account (A14), used car purchases (A41), critical credit history (A34), foreign worker status (A202), and no savings (A65, A64), all of which have negative coefficients pushing toward bad credit### Odds-Ratio plot```{r}# Convert to odds ratiossig_coef$OR <-exp(sig_coef$Estimate)sig_coef$CI_lower <-exp(sig_coef$Estimate -1.96* sig_coef$`Std. Error`)sig_coef$CI_upper <-exp(sig_coef$Estimate +1.96* sig_coef$`Std. Error`)# Plot odds ratiosggplot(sig_coef,aes(x =reorder(variable, OR), y = OR)) +geom_point(size =3, color ="steelblue") +geom_errorbar(aes(ymin = CI_lower, ymax = CI_upper),width =0.2, color ="gray50" ) +geom_hline(yintercept =1, linetype ="dashed", color ="red") +coord_flip() +labs(title ="Odds Ratios: How Each Factor Affects Credit Risk",subtitle ="Values >1 increase good credit probability | Values <1 decrease it",x ="Predictor Variables",y ="Odds Ratio (with 95% CI)" )```Only `installment_rate` shows a confidently protective effect (OR ≈ 1.4), while `duration` and `credit_amount` sit near 1.0 The strongest and most reliable risk factors are `checking_statusA14` (OR ≈ 0.18),`purposeA41` (OR ≈ 0.19), and`credit_historyA34` (OR ≈ 0.24),all with narrow confidence intervals. The remaining predictors cluster between 0.3 and 0.5, representing a moderate risk group, though `foreign_workerA202` and `savingsA64` carry wide intervals and should be interpreted cautiously.### CV model ```{r}library(caret)set.seed(123)ctrl <-trainControl(method ="cv",number =5,classProbs =TRUE,summaryFunction = twoClassSummary)cv_model <-train( class ~ .,data = german_data,method ="glm",family = binomial,trControl = ctrl,metric ="ROC")print(cv_model)```Five-fold cross-validation yielded an AUC of 0.778, confirming the model generalises reasonably well to unseen data. However, the results reveal a notable imbalance: sensitivity of 87% indicates the model is effective at identifying creditworthy customers, but specificity of only 48% means the majority of actual defaulters are misclassified as good. This asymmetry stems from the class imbalance in the dataset (70% good, 30% bad) and has significant business implications — missed defaults represent direct financial losses. Future work should explore resampling techniques such as SMOTE or cost-sensitive learning to improve the model's ability to detect bad credit customers without sacrificing too much sensitivity.### ROC full curve ```{r}library(pROC)# Predictionsprob_aic <-predict(model_aic, german_data, type ="response")prob_full <-predict(model_full, german_data, type ="response")# ROC objectsroc_aic <-roc(german_data$class, prob_aic, levels =c("good", "bad"), direction ="<")roc_full <-roc(german_data$class, prob_full, levels =c("good", "bad"), direction ="<")# Plot both ROC curvesplot(roc_full, col ="gray", lwd =2, main ="ROC Curve Comparison")plot(roc_aic, col ="steelblue", lwd =2, add =TRUE)# Add AUC valueslegend("bottomright",legend =c(paste("Full Model AUC:", round(auc(roc_full), 3)),paste("AIC Model AUC:", round(auc(roc_aic), 3))),col =c("gray", "steelblue"),lwd =2)abline(a =0, b =1, lty =2, col ="red")```The ROC curve comparison reveals that the AIC-selected stepwise model (AUC = 0.828) performs nearly identically to the full model (AUC = 0.834) despite using considerably fewer predictors — a difference of just 0.006 AUC that is negligible in practice. Visually, the two curves are almost indistinguishable. This supports selecting the stepwise model on grounds of parsimony and interpretability, both critical considerations in regulated credit risk environments. Notably, both in-sample AUCs exceed the cross-validated estimate of 0.778, indicating some degree of overfitting — the CV result should be treated as the more realistic performance benchmark for deployment.### Confusion matrix```{r}# Find optimal thresholdoptimal_coords <-coords(roc_aic, "best", best.method ="closest.topleft",ret =c("threshold", "specificity", "sensitivity"))# Predict using optimal thresholdpredictions <-ifelse(prob_aic > optimal_coords$threshold, "bad", "good")predictions <-factor(predictions, levels =c("good", "bad"))# Confusion matrixconf_matrix <-confusionMatrix(predictions, german_data$class)# Extract key metricsmetrics <-data.frame(Metric =c("Accuracy", "Sensitivity (Good)", "Specificity (Bad)", "Precision", "F1 Score", "AUC"),Value =c(conf_matrix$overall["Accuracy"], conf_matrix$byClass["Sensitivity"], conf_matrix$byClass["Specificity"], conf_matrix$byClass["Precision"], conf_matrix$byClass["F1"],auc(roc_aic)))kable(metrics, caption ="Model Performance Metrics", digits =3)```Using the optimal classification threshold identified from the ROC curve, the model achieves **75.3% accuracy** — significantly above the no-information rate of 70% (p < 0.001). Crucially, the model demonstrates balanced performance with sensitivity and specificity both near 75%, a substantial improvement over the raw cross-validated specificity of 48%. Of the 300 actual bad credit customers, 225 are correctly identified, while 75 are misclassified as good — representing the highest-cost errors from a business perspective. The positive predictive value of 87.6% indicates that loan approvals from this model are highly reliable, though the negative predictive value of 56.7% suggests flagged rejections should be reviewed manually rather than automatically declined.## Business insights:```{r}business_insights <-data.frame(Risk_Factor =c("No checking account (A14)","Used car purchase (A41)","Critical credit history (A34)","No savings / unknown savings (A65)","Mid-range savings 500-1000 DM (A64)","No other installment plans (A143)","Foreign worker status (A202)","High installment rate","Borderline model predictions" ),Model_Evidence =c("Strongest predictor — OR ≈ 0.18, p < 0.001","Second strongest — OR ≈ 0.19, p < 0.001","OR ≈ 0.24, p = 0.001 — reliable estimate","OR ≈ 0.39, p < 0.001","OR ≈ 0.27, p = 0.01 — wide CI, treat cautiously","OR ≈ 0.52, p = 0.007","OR ≈ 0.25, p = 0.03 — very wide CI","OR ≈ 1.39, p < 0.001 — only confidently protective factor","Model NPV = 56.7% — 'bad' predictions correct just over half the time" ),Impact =c("Highest Risk","Very High Risk","Very High Risk","High Risk","High Risk — uncertain","Moderate Risk","High Risk — flag for fairness review","Protective","Model Uncertainty" ),Business_Action =c("Require collateral or guarantor before approval; no checking account removes key repayment monitoring tool","Require larger down payment; used cars depreciate rapidly and offer weak collateral value","Request full credit report and alternative income evidence; do not auto-reject — review manually","Verify alternative assets (property, pension); absence of savings removes financial buffer for repayment shocks","Estimate is unstable — do not over-penalise; request savings documentation before decision","Request explanation of financial obligations; absence of existing plans may signal income instability","Verify employment contract and stability; note: demographic factors require consistent application to comply with fair lending regulations","Treat as positive repayment capacity signal; customers committing higher income share demonstrate financial confidence","Do not auto-reject borderline cases — refer to human underwriter; automated rejection risks wrongly denying credit to 4 in 10 flagged customers" ))kable(business_insights, caption ="Business Recommendations Derived from Model Evidence",col.names =c("Risk Factor", "Model Evidence", "Impact", "Recommended Action"))```## Limitations:**1. No holdout test set**The confusion matrix and ROC curve were evaluated on the same data the model was trained on, which means performance is likely slightly optimistic. The 5-fold cross-validated AUC of 0.778 is the more honest estimate of how the model would perform on new loan applications. In future work, I would set aside a dedicated test set before any model training to get a fully independent evaluation.**2. Class imbalance**The dataset contains 700 good and 300 bad credit customers. This imbalance pushes the model toward predicting "good" by default, which is reflected in the low negative predictive value of 56.7%. Techniques like SMOTE or cost-sensitive learning could help the model better identify defaulters**3. Suppressor effects in continuous variables**Duration and credit amount showed positive coefficients in the model despite the EDA showing bad customers tend to have longer loans and larger amounts. This contradiction likely reflects multicollinearity with other predictors rather than a genuine protective effect.**4. Dataset age and context**The German Credit dataset originates from the 1970s and reflects a specific historical and economic context that may not generalise to modern lending. Spending patterns, financial products, and borrower demographics have changed substantially since then.**5. Ethical concern with demographic variables**Foreign worker status was statistically significant in the model, but using demographic characteristics in credit decisions raises serious fairness and legal concerns under modern equal credit opportunity regulations. In a real deployment, this variable would need to be reviewed carefully or removed entirely.## Conclusion :This analysis set out to identify the key drivers of credit default risk using the German Credit dataset. The stepwise logistic regression model emerged as the preferred approach because the AUC is slightly better than full model regression (AUC 0.828 vs 0.834) with fewer predictors, lower AIC, and greater interpretability. The strongest and most actionable finding is that the absence of financial footprint — no checking account, no savings, no credit history — is the clearest signal of default risk. Customers cannot be assessed reliably when there is nothing to assess.That said, the model has real limitations. The cross-validated AUC of 0.778 is the honest performance estimate, and the negative predictive value of 56.7% means the model should never be used to automatically reject applicants. It is a decision-support tool, not a decision-maker.If I were to extend this work, I would implement a formal train/test split, explore SMOTE to address class imbalance, and test a gradient boosting model to capture non-linear interactions the logistic regression cannot. The business recommendations derived here provide a foundation — but responsible deployment would require further validation on more recent, representative data.## References:Hofmann, H. (1994). *Statlog (German Credit Data)*. UCI Machine Learning Repository. https://doi.org/10.24432/C5NC77James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). *An Introduction to Statistical Learning with Applications in R* (2nd ed.). Springer.R Core Team. (2024). *R: A Language and Environment for Statistical Computing* (Version 4.4.2). R Foundation for Statistical Computing.Wickham, H. (2016). *ggplot2: Elegant Graphics for Data Analysis*. Springer.Kuhn, M. (2008). Building Predictive Models in R Using the caret Package. *Journal of Statistical Software, 28*(5). https://doi.org/10.18637/jss.v028.i05Robin, X., et al. (2011). pROC: An open-source package for R and S+. *BMC Bioinformatics, 12*, 77.Venables, W. N., & Ripley, B. D. (2002). *Modern Applied Statistics with S* (4th ed.). Springer.