When assessing the risk of credit card default, financial institutions rely on predictive models to make informed decisions about lending. Logistic Regression is a traditional statistical method commonly used for binary classification problems, such as predicting whether a customer will default on a credit card payment. Its simplicity, interpretability, and relatively low computational cost make it a popular choice in the financial sector. Logistic Regression models the probability of default as a function of various customer characteristics, providing insights that are easy to understand and communicate.
It is very important to observe we cannot really control What Type of Customer Will Approach the Bank so here regression aspect of logistic model is meaningless as we cannot change the predictors to get the suitable value of response.
So we will use Logistic Regression for prediction Purpose and compute
Dataset Link : Click Here
ID: A unique identifier for each customer.
Gender: The gender of the customer (e.g., Male, Female). Categorical variable.
Own_car: Indicates whether the customer owns a car (e.g., Yes, No). Binary categorical variable.
Own_property: Indicates whether the customer owns property (e.g., Yes, No). Binary categorical variable.
Work_phone: Indicates whether the customer has a work phone (e.g., Yes, No). Binary categorical variable.
Phone: Indicates whether the customer has a personal phone (e.g., Yes, No). Binary categorical variable.
Email: Indicates whether the customer has an email address (e.g., Yes, No). Binary categorical variable.
Unemployed: Indicates whether the customer is unemployed (e.g., Yes, No). Binary categorical variable.
Num_children: The number of children the customer has. Numeric variable.
Num_family: The number of family members the customer has. Numeric variable.
Account_length: The length of time the customer has held their account. Numeric variable (often in months or years).
Total_income: The total income of the customer. Numeric variable.
Age: The age of the customer. Numeric variable.
Years_employed: The number of years the customer has been employed. Numeric variable.
Income_type: The type of income the customer receives (e.g., Salary, Business, Pension). Categorical variable.
Education_type: The level of education the customer has attained (e.g., High School, Bachelor’s, Master’s). Categorical variable.
Family_status: The customer’s family status (e.g., Single, Married, Divorced). Categorical variable.
Housing_type: The type of housing the customer lives in (e.g., Owned, Rented). Categorical variable.
Occupation_type: The customer’s occupation (e.g., Professional, Clerical, Service). Categorical variable.
Target: The variable indicating the risk outcome (e.g., Defaulted, Not Defaulted). This is the dependent variable you are predicting.Getting the required packages:
pacman::p_load(caret,ROCR,hnp,ggcorrplot)
Loading the data
data=read.csv("C:\\Users\\zeeda\\OneDrive\\Desktop\\dataset.csv")
data=data[,-1] #-- removing the ID column
sum(is.na(data)) #-- no missing values
[1] 0
Converting the categorical variables into factors
for (i in c(1:7,14:19)){
data[,i]=as.factor(data[,i])
}
str(data)
'data.frame': 9709 obs. of 19 variables:
$ Gender : Factor w/ 2 levels "0","1": 2 2 1 1 2 2 1 1 1 2 ...
$ Own_car : Factor w/ 2 levels "0","1": 2 2 1 1 2 2 2 1 1 2 ...
$ Own_property : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 2 2 2 ...
$ Work_phone : Factor w/ 2 levels "0","1": 2 1 1 1 2 1 1 1 1 1 ...
$ Phone : Factor w/ 2 levels "0","1": 1 1 2 1 2 1 1 2 1 1 ...
$ Email : Factor w/ 2 levels "0","1": 1 1 2 1 2 1 1 1 1 1 ...
$ Unemployed : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 1 1 1 ...
$ Num_children : int 0 0 0 0 0 0 0 0 1 3 ...
$ Num_family : int 2 2 1 1 2 2 2 2 2 5 ...
$ Account_length : int 15 29 4 20 5 17 25 31 44 24 ...
$ Total_income : num 427500 112500 270000 283500 270000 ...
$ Age : num 32.9 58.8 52.3 61.5 46.2 ...
$ Years_employed : num 12.44 3.1 8.35 0 2.11 ...
$ Income_type : Factor w/ 5 levels "Commercial associate",..: 5 5 1 2 5 1 5 5 5 5 ...
$ Education_type : Factor w/ 5 levels "Academic degree",..: 2 5 5 2 2 5 3 5 5 5 ...
$ Family_status : Factor w/ 5 levels "Civil marriage",..: 1 2 4 3 2 2 2 2 4 2 ...
$ Housing_type : Factor w/ 6 levels "Co-op apartment",..: 5 2 2 2 2 2 2 2 2 2 ...
$ Occupation_type: Factor w/ 19 levels "Accountants",..: 13 18 16 13 1 9 1 9 13 9 ...
$ Target : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 2 1 1 ...
summary(data)
Gender Own_car Own_property Work_phone Phone Email Unemployed
0:6323 0:6139 0:3189 0:7598 0:6916 0:8859 0:8013
1:3386 1:3570 1:6520 1:2111 1:2793 1: 850 1:1696
Num_children Num_family Account_length Total_income
Min. : 0.0000 Min. : 1.000 Min. : 0.00 Min. : 27000
1st Qu.: 0.0000 1st Qu.: 2.000 1st Qu.:13.00 1st Qu.: 112500
Median : 0.0000 Median : 2.000 Median :26.00 Median : 157500
Mean : 0.4228 Mean : 2.183 Mean :27.27 Mean : 181228
3rd Qu.: 1.0000 3rd Qu.: 3.000 3rd Qu.:41.00 3rd Qu.: 225000
Max. :19.0000 Max. :20.000 Max. :60.00 Max. :1575000
Age Years_employed Income_type
Min. :20.50 Min. : 0.0000 Commercial associate:2312
1st Qu.:34.06 1st Qu.: 0.9282 Pensioner :1712
Median :42.74 Median : 3.7619 State servant : 722
Mean :43.78 Mean : 5.6647 Student : 3
3rd Qu.:53.57 3rd Qu.: 8.2000 Working :4960
Max. :68.86 Max. :43.0207
Education_type Family_status
Academic degree : 6 Civil marriage : 836
Higher education :2457 Married :6530
Incomplete higher : 371 Separated : 574
Lower secondary : 114 Single_unmarried:1359
Secondary_secondary special:6761 Widow : 410
Housing_type Occupation_type Target
Co-op apartment : 34 Other :2994 0:8426
House_apartment :8684 Laborers :1724 1:1283
Municipal apartment: 323 Sales staff: 959
Office apartment : 76 Core staff : 877
Rented apartment : 144 Managers : 782
With parents : 448 Drivers : 623
(Other) :1750
categorical_cols <- c('Gender', 'Own_car', 'Own_property', 'Work_phone', 'Phone', 'Email', 'Unemployed',
'Income_type', 'Education_type', 'Family_status', 'Housing_type', 'Occupation_type')
numeric_cols <- c('Num_children', 'Num_family', 'Account_length', 'Total_income', 'Age', 'Years_employed')
# Distribution of the target variable (High risk vs Low risk)
barplot(table(data$Target),
main = "Distribution of Target Variable",
xlab = "Risk (0 = Low, 1 = High)",
ylab = "Count",
col = "lightblue")
# Correlation between numeric variables
numeric_data <- data[, numeric_cols]
corr_matrix <- cor(numeric_data, use = "complete.obs")
# Visualize the correlation matrix
ggcorrplot(corr_matrix,method="square",type="lower",lab=TRUE)
# Plotting numeric variables using histograms
par(mfrow=c(3,2))
for (var in numeric_cols) {
hist(data[[var]],
main = paste("Distribution of", var),
xlab = var,
col = "lightgreen",
breaks = 20)
}
par(mfrow=c(3,2))
# Boxplots to check distribution of numeric variables by Target
for (var in numeric_cols) {
boxplot(data[[var]] ~ data$Target,
main = paste(var, "by Risk Group"),
xlab = "Risk Group (0 = Low, 1 = High)",
ylab = var,
col = "lightcoral")
}
par(mfrow=c(4,3))
# Plotting categorical variables using barplots
for (var in categorical_cols) {
barplot(table(data[[var]]),
main = paste("Distribution of", var),
xlab = var,
ylab = "Count",
col = "lightyellow")
}
par(mfrow=c(4,3))
# Cross-tabulation of categorical variables with the Target variable
for (var in categorical_cols) {
cat_table <- table(data[[var]], data$Target)
barplot(cat_table,
beside = TRUE,
main = paste(var, "by Risk Group"),
col = c("lightblue", "lightpink"),
legend = rownames(cat_table),
xlab = var, ylab = "Count")
}
Before constructing the model we will check whether the resposne vraible Target is balanced or not
table(data$Target)
0 1
8426 1283
index_0=which(data$Target==0)
index=sample(index_0,(8426-1283),F)
data=data[-index,]
table(data$Target)
0 1
1283 1283
set.seed(42)
size=floor(nrow(data)*0.8)
split_index=sample(1:nrow(data),size,F)
train_data=data[split_index,]
test_data=data[-split_index,]
Building the model
model=glm(Target~.,data=train_data,family="binomial")
summary(model)
##
## Call:
## glm(formula = Target ~ ., family = "binomial", data = train_data)
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -3.180e-01 1.680e+00 -0.189
## Gender1 1.179e-01 1.187e-01 0.993
## Own_car1 -1.633e-01 1.075e-01 -1.519
## Own_property1 -2.237e-01 1.014e-01 -2.206
## Work_phone1 9.347e-02 1.237e-01 0.756
## Phone1 -3.167e-02 1.069e-01 -0.296
## Email1 -2.344e-01 1.614e-01 -1.452
## Unemployed1 -1.445e+01 2.602e+02 -0.056
## Num_children -5.048e-01 3.562e-01 -1.417
## Num_family 5.649e-01 3.497e-01 1.615
## Account_length 1.430e-02 2.819e-03 5.073
## Total_income 7.221e-07 4.881e-07 1.479
## Age -1.178e-02 5.883e-03 -2.001
## Years_employed -1.211e-02 8.839e-03 -1.370
## Income_typePensioner 1.456e+01 2.602e+02 0.056
## Income_typeState servant 8.179e-02 2.069e-01 0.395
## Income_typeStudent 1.482e+01 8.827e+02 0.017
## Income_typeWorking 1.225e-01 1.133e-01 1.081
## Education_typeHigher education -1.144e+00 1.238e+00 -0.924
## Education_typeIncomplete higher -9.122e-01 1.255e+00 -0.727
## Education_typeLower secondary -6.931e-01 1.309e+00 -0.529
## Education_typeSecondary_secondary special -1.055e+00 1.236e+00 -0.854
## Family_statusMarried -1.862e-02 1.592e-01 -0.117
## Family_statusSeparated 6.409e-01 4.237e-01 1.513
## Family_statusSingle_unmarried 6.016e-01 3.758e-01 1.601
## Family_statusWidow 5.478e-01 4.513e-01 1.214
## Housing_typeHouse_apartment 2.819e-01 8.014e-01 0.352
## Housing_typeMunicipal apartment 4.022e-01 8.399e-01 0.479
## Housing_typeOffice apartment 2.759e-01 1.002e+00 0.275
## Housing_typeRented apartment 8.137e-01 8.954e-01 0.909
## Housing_typeWith parents 1.842e-01 8.296e-01 0.222
## Occupation_typeCleaning staff 4.170e-01 4.864e-01 0.857
## Occupation_typeCooking staff 2.565e-01 4.248e-01 0.604
## Occupation_typeCore staff 3.984e-01 3.072e-01 1.297
## Occupation_typeDrivers -9.525e-02 3.287e-01 -0.290
## Occupation_typeHigh skill tech staff 4.368e-01 3.560e-01 1.227
## Occupation_typeHR staff 8.720e-01 1.286e+00 0.678
## Occupation_typeIT staff 1.111e+00 1.283e+00 0.865
## Occupation_typeLaborers -4.516e-02 2.916e-01 -0.155
## Occupation_typeLow-skill Laborers -1.178e-01 5.575e-01 -0.211
## Occupation_typeManagers 1.666e-02 3.065e-01 0.054
## Occupation_typeMedicine staff 3.782e-01 3.779e-01 1.001
## Occupation_typeOther -8.573e-03 2.897e-01 -0.030
## Occupation_typePrivate service staff -1.998e-01 5.784e-01 -0.346
## Occupation_typeRealty agents -1.454e+01 5.868e+02 -0.025
## Occupation_typeSales staff -2.926e-01 3.021e-01 -0.969
## Occupation_typeSecretaries 1.215e-01 8.731e-01 0.139
## Occupation_typeSecurity staff 3.283e-01 4.119e-01 0.797
## Occupation_typeWaiters_barmen staff -3.480e-01 6.571e-01 -0.530
## Pr(>|z|)
## (Intercept) 0.8499
## Gender1 0.3206
## Own_car1 0.1288
## Own_property1 0.0274 *
## Work_phone1 0.4497
## Phone1 0.7671
## Email1 0.1464
## Unemployed1 0.9557
## Num_children 0.1565
## Num_family 0.1062
## Account_length 3.91e-07 ***
## Total_income 0.1390
## Age 0.0453 *
## Years_employed 0.1708
## Income_typePensioner 0.9554
## Income_typeState servant 0.6926
## Income_typeStudent 0.9866
## Income_typeWorking 0.2798
## Education_typeHigher education 0.3554
## Education_typeIncomplete higher 0.4672
## Education_typeLower secondary 0.5965
## Education_typeSecondary_secondary special 0.3933
## Family_statusMarried 0.9069
## Family_statusSeparated 0.1304
## Family_statusSingle_unmarried 0.1094
## Family_statusWidow 0.2248
## Housing_typeHouse_apartment 0.7251
## Housing_typeMunicipal apartment 0.6320
## Housing_typeOffice apartment 0.7830
## Housing_typeRented apartment 0.3635
## Housing_typeWith parents 0.8243
## Occupation_typeCleaning staff 0.3913
## Occupation_typeCooking staff 0.5460
## Occupation_typeCore staff 0.1947
## Occupation_typeDrivers 0.7720
## Occupation_typeHigh skill tech staff 0.2198
## Occupation_typeHR staff 0.4978
## Occupation_typeIT staff 0.3868
## Occupation_typeLaborers 0.8769
## Occupation_typeLow-skill Laborers 0.8327
## Occupation_typeManagers 0.9566
## Occupation_typeMedicine staff 0.3169
## Occupation_typeOther 0.9764
## Occupation_typePrivate service staff 0.7297
## Occupation_typeRealty agents 0.9802
## Occupation_typeSales staff 0.3327
## Occupation_typeSecretaries 0.8893
## Occupation_typeSecurity staff 0.4254
## Occupation_typeWaiters_barmen staff 0.5964
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2844.3 on 2051 degrees of freedom
## Residual deviance: 2745.4 on 2003 degrees of freedom
## AIC: 2843.4
##
## Number of Fisher Scoring iterations: 13
Checking the model adequacy using hnp plot and residual vs predicted value plot
set.seed(42)
hnp(model,main=" Half normal blood with simulated envelope")
Binomial model
dev_res=resid(model,type="deviance")
pred= fitted(model)
plot(pred,dev_res,ylim=c(-3,3),xlab="predicted values",ylab="deviance residuals",main= "Deviance residual vs Predicted value plot")
* Clearly there is a pattern here and the residuals are not randomly
spread out * We can conclude that the model is not a very good
fit
test_pred=predict(model,newdata = test_data[,-19])
test_pred_bin=ifelse(test_pred>0.5,1,0)
table(test_pred_bin)
test_pred_bin
0 1
451 63
set.seed(42)
confusionMatrix(as.factor(test_pred_bin),test_data$Target)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 223 228
1 21 42
Accuracy : 0.5156
95% CI : (0.4714, 0.5595)
No Information Rate : 0.5253
P-Value [Acc > NIR] : 0.6866
Kappa : 0.0668
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.9139
Specificity : 0.1556
Pos Pred Value : 0.4945
Neg Pred Value : 0.6667
Prevalence : 0.4747
Detection Rate : 0.4339
Detection Prevalence : 0.8774
Balanced Accuracy : 0.5347
'Positive' Class : 0
test_pred2=predict(model,newdata = test_data[,-19],type="response")
test_pred3=prediction(test_pred2,test_data$Target)
perf=performance(test_pred3,"tpr","fpr")
plot(perf,colorize=T,main="ROC Curve")
curve(1*x,add=T,col="blue")
* AUC ROC score
auc=performance(test_pred3,"auc")@y.values[[1]]
auc
## [1] 0.5618549
The project aimed to predict the risk of credit card borrower default using logistic regression. The dataset included customer demographic, financial, and employment-related features. The data was preprocessed to handle categorical and numerical variables appropriately, and an exploratory data analysis (EDA) was conducted to understand the distribution and correlations among the features. The logistic regression model was trained and evaluated to predict whether a customer is a high or low credit risk.
Key metrics such as accuracy, F1 score, and ROC-AUC were calculated to assess model performance. Additionally, model residuals were analyzed to check the adequacy of the logistic regression fit.
The logistic regression model showed moderate performance with an accuracy of approximately 55%. The ROC-AUC score of 0.5 suggests that the model’s predictive power was slightly better than random guessing but not highly robust. Furthermore, the deviance residual analysis indicated that the model was not an ideal fit, as patterns were observed in the residuals, highlighting the need for a more complex model or additional feature engineering.
Despite the limitations, the model demonstrated an ability to
identify some key predictors, such as Account_length,
Age, and Years_employed, which significantly
impacted the risk of default.
The analysis revealed that logistic regression, while easy to
interpret, struggled with the imbalanced nature of the dataset and the
complexity of relationships between features. Key features like
Email, Age, and Years_employed
were found to be significant predictors, but the high residuals and low
specificity suggested that the model did not capture the full complexity
of the problem.
For future work, alternative models such as random forests or gradient boosting could be explored to improve predictive accuracy. Additionally, addressing class imbalance with techniques like SMOTE (Synthetic Minority Over-sampling Technique) may lead to better model performance.
The project successfully provided insights into the features influencing credit risk and set the foundation for further refinement in predictive modeling.