Gathered from the UC Irvine Machine Learning Repository, this data contains randomly collected from an Iranian telecom company’s database over a period of 12 months. A total of 3150 rows of data, each representing a customer, bear information for 13 columns. The attributes that are in this dataset are call failures, frequency of SMS, number of complaints, number of distinct calls, subscription length, age group, the charge amount, type of service, seconds of use, status, frequency of use, and Customer Value.
All of the attributes except for attribute churn is the aggregated data of the first 9 months. The churn labels are the state of the customers at the end of 12 months. The three months is the designated planning gap.
Task: Predict whether a credit card customer will default.
Data values:
2: No consumption
1: Payment made on time
0: Use of revolving credit
1: No payment made for the past one month
2: No payment made for the past two months
9: No payment made for the past nine months
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
## Warning: package 'caret' was built under R version 4.3.2
## Loading required package: lattice
## Warning: package 'rattle' was built under R version 4.3.2
## Loading required package: tibble
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
## Credit.Limit PS.Curr PS.1M.ago PS.2M.ago
## Min. : 10000 Min. :-2.00000 Min. :-2.000 Min. :-2.0000
## 1st Qu.: 50000 1st Qu.:-1.00000 1st Qu.:-1.000 1st Qu.:-1.0000
## Median : 140000 Median : 0.00000 Median : 0.000 Median : 0.0000
## Mean : 167309 Mean :-0.01613 Mean :-0.124 Mean :-0.1611
## 3rd Qu.: 240000 3rd Qu.: 0.00000 3rd Qu.: 0.000 3rd Qu.: 0.0000
## Max. :1000000 Max. : 8.00000 Max. : 7.000 Max. : 8.0000
## Default
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.2212
## 3rd Qu.:0.0000
## Max. :1.0000
## Credit.Limit PS.Curr PS.1M.ago PS.2M.ago
## Min. : 10000 Min. :-2.00000 Min. :-2.000 Min. :-2.0000
## 1st Qu.: 50000 1st Qu.:-1.00000 1st Qu.:-1.000 1st Qu.:-1.0000
## Median :140000 Median : 0.00000 Median : 0.000 Median : 0.0000
## Mean :167543 Mean :-0.01689 Mean :-0.137 Mean :-0.1679
## 3rd Qu.:240000 3rd Qu.: 0.00000 3rd Qu.: 0.000 3rd Qu.: 0.0000
## Max. :800000 Max. : 8.00000 Max. : 8.000 Max. : 8.0000
## Default
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.2212
## 3rd Qu.:0.0000
## Max. :1.0000
# set Default as factor
def.tr$Default<- as.factor(def.tr$Default)
def.te$Default<- as.factor(def.te$Default)
# Non-normalized histogram
ggplot(def.tr, aes(PS.Curr))+
geom_histogram(aes(fill = Default),
color = "black", binwidth= 1) +
scale_fill_manual(name="Default Status",
values=c("#1f78b480","#33a02c80"),
labels=c("No","Yes"))+
xlab("Payment Status") +
ylab("Customer Frequency") +
ggtitle("Histogram of CPS and Default Status")From the non-normalized histogram we see the majority of non-defaulters use revolving credit to make their payments (X = 0). The second largest portion of non-defaulters pay their bill on time (X = -1).
# Normalized histogram
ggplot(def.tr, aes(PS.Curr)) +
geom_histogram(aes(fill = Default),
color = "black", binwidth= 1, position = "fill") +
scale_fill_manual(name="Default Status",
values=c("#1f78b480","#33a02c80"),
labels=c("No","Yes")) +
xlab("Payment Status") +
ylab("Customer Frequency") +
ggtitle("Normalized Histogram of CPS and Default Status")The normalized histogram illustrate that higher rates of defaulters are positively associated with an increase in the number of months a customer has gone without making a payment.
# Comparison box plot of customer payment status as of one month ago, by Default.
boxplot(PS.1M.ago ~ as.factor(Default),
col = "#1f78b480", data = def.tr,
main="Boxplot of CPS (1M) by Default Status",
xlab="Default Status",
ylab="CPS as of one month ago") The median value (Q2) of customer payment status is nearly 0 for both customers who do and do not default. The lower quartile (Q1) of both boxplots sit near a value of 1, indicating there are a significant number of customers in both groups that pay their bill on time. However, the upper quartile (Q3) of the boxplot for defaulters is higher and sits around 2, suggesting default status is associated more with overdue payments. The maximum value of defaulters sits around 6, creating a wider range of values for defaulters.
Although there are many outlier points representing payment status for non-defaulters, the high maximum value of payment status for defaulters suggests there are more customers who do not pay their bill on time within that group.
# Kruskal-Wallis test associated with the above box plot
kruskal.test(PS.1M.ago ~ as.factor(Default), data = def.tr)##
## Kruskal-Wallis rank sum test
##
## data: PS.1M.ago by as.factor(Default)
## Kruskal-Wallis chi-squared = 1006.3, df = 1, p-value < 2.2e-16
To test whether a defaulters have significantly higher CPS values compared to non-defaulters, a Kruskal-Wallis test was conducted. The null hypothesis assumes PS.1M.ago scores among defaulters and non-defaulters are the same. The alternative hypothesis assumes PS.1M.ago scores are statistically different.
Results indicate chi-squared = 1006.3 with an extremely significant p-value of 0.001***. The null hypothesis is rejected and the alternative hypothesis is supported. PS.1M.ago scores are significantly higher for defaulters than for non-defaulters, suggesting a relationship between overdue payments and defaulting.
## Credit.Limit PS.Curr PS.1M.ago PS.2M.ago
## Credit.Limit 1.0000000 -0.2715057 -0.2979874 -0.2861997
## PS.Curr -0.2715057 1.0000000 0.6704944 0.5735308
## PS.1M.ago -0.2979874 0.6704944 1.0000000 0.7676715
## PS.2M.ago -0.2861997 0.5735308 0.7676715 1.0000000
# correlation plot
pairs.panels(def.tr[, c(1:4)],
hist.col="#33a02c80",
show.points=TRUE,
stars=TRUE,
gap=0.05,
ellipses=FALSE,
scale=FALSE,
jiggle=TRUE,
factor=2,
pch=".",
main="Correlation Plot",
col="blue",
font=2)All predictors are negatively correlated with credit limit scores suggesting as credit limit scores rise, payment status scores fall, or the less overdue payments there are.
The largest correlation is between payment status as of one month and payment status as of two months ago at a value of 0.77. This might suggest the more months a customer goes without paying, the more likely they will not pay the following month.
Current payment status, payment status of one month ago and payment status of two months ago are all positively correlated. This suggests a strong relationship between customers with overdue payments in the current month and overdue payments in previous months.
# standardize training set, def.tr.z.
set.seed(123)
preprocess_z <-
preProcess(def.tr,
method=c("center", "scale"))
def.tr.z <-
predict(preprocess_z, def.tr)
describe(def.tr.z)## vars n mean sd median trimmed mad min max range skew
## Credit.Limit 1 22500 0.00 1.00 -0.21 -0.12 1.03 -1.22 4.89 6.11 0.98
## PS.Curr 2 22500 0.00 1.00 0.02 -0.04 1.32 -1.77 7.15 8.92 0.70
## PS.1M.ago 3 22500 0.00 1.00 0.11 -0.05 0.00 -1.56 6.80 8.36 0.78
## PS.2M.ago 4 22500 0.00 1.00 0.14 -0.06 0.00 -1.53 6.84 8.37 0.82
## Default* 5 22500 1.22 0.42 1.00 1.15 0.00 1.00 2.00 1.00 1.34
## kurtosis se
## Credit.Limit 0.46 0.01
## PS.Curr 2.53 0.01
## PS.1M.ago 1.50 0.01
## PS.2M.ago 1.94 0.01
## Default* -0.20 0.00
# standardize test set, def.te.z
preprocess_ze <- preProcess(def.te,
method=c("center", "scale"))
def.te.z <-
predict(preprocess_ze, def.te)
describe(def.te.z)## vars n mean sd median trimmed mad min max range skew
## Credit.Limit 1 7500 0.00 1.00 -0.21 -0.12 1.02 -1.20 6.36 7.56 1.03
## PS.Curr 2 7500 0.00 1.00 0.01 -0.04 1.31 -1.75 7.07 8.82 0.81
## PS.1M.ago 3 7500 0.00 1.00 0.10 -0.05 0.00 -1.56 5.93 7.49 0.83
## PS.2M.ago 4 7500 0.00 1.00 0.13 -0.06 0.00 -1.53 6.77 8.30 0.90
## Default* 5 7500 1.22 0.42 1.00 1.15 0.00 1.00 2.00 1.00 1.34
## kurtosis se
## Credit.Limit 0.75 0.01
## PS.Curr 3.25 0.01
## PS.1M.ago 1.78 0.01
## PS.2M.ago 2.50 0.01
## Default* -0.20 0.00
The test and training sets were standardized. Summary statistics show successful standardization with standard deviations of 1 and means of 0 for each predictor variable.
##
## 0 1
## 17523 4977
Using the distribution of Default, we can create an all-positive and all-negative model to establish baseline model performance. Customers with Default = 0 represent the all-negative model and Default = 1 represent the all positive model.
All Positive Model
| Predicted Category | |||
|---|---|---|---|
| Actual Category | Default = 0 | Default = 1 | Total |
| Default = 0 | 0 | 17,523 | 17,523 |
| Default = 1 | 0 | 4,977 | 4,977 |
| Total | 0 | 22,500 | 22,500 |
All Negative Model
| Predicted Category | |||
|---|---|---|---|
| Actual Category | Default = 0 | Default = 1 | Total |
| Default = 0 | 17,523 | 0 | 17,523 |
| Default = 1 | 4,977 | 0 | 4,977 |
| Total | 22,500 | 0 | 22,500 |
\(Accuracy = TN + TP/GT= (17523+0) / 22500= 77.88%\)
The all-negative model (accuracy = 77.88%) will be used as the baseline model given the high accuracy compared to the all-positive model (accuracy = 22.12%). Our model will have to beat the all-negative model.
# Note:
# default (1) as positive = 4977
# no default (0) as negative = 17523
# precision: 0 /17523 = 0
# accuracy: (0 + 17523)/ 22,500 = 0.7788
# sensitivity = 0 / 0 = 0
# specificity = 4,977 /22,500 = 0.2212
# precision = 0 / (0 +4,977) = 0
# recall = 0 / 0 + 4,977 = = 0
# Fbeta = precision * recall / (beta^2 * precision) + recall
# 1 + beta^2 * (0.2840 * 1) / ((0.2840) + 1) = 0.2211838
# 2 * 0.2211838
# F1 = 0.4423676
# 1 + beta * (0.2840 *1)/ (((2*2) * 0.2840) + 1)
# (1 + 2^2) * 0.1329588
# F2 = 0.664794
# 1 + beta * (0.2840 *1)/ (((0.5*0.5) * 0.2840) + 1)
# (1 + 0.5^2) * 0.2651727
# F0.5 = 0.3314659A Classification and Regression Tree model was employed to predict Default.
set.seed(1000)
TC <- trainControl(
method = "CV",
number = 10)
fit <- train(Default ~ ., data = def.tr.z,
method = "rpart2", #CART
trControl = TC)
fancyRpartPlot(fit$finalModel, main = "CART Model: Predicting Default Status")The above decision tree predicts default status based on variables Credit.Limit, PS.Curr, PS.1m.ago, and PS.2m.ago using a classification and regression-based algorithm. In this chart, blue leaves indicate defaulting customers (22.12% of records in the training data set) and green leaves indicate low-income earners (87.88% of records). The split is made for whether the value of a customer’s current payment status is less than 1.4, or if a payment was made longer than one month ago. Decision rules are as follows:
If current payment status < 1.4, then a customer’s default status = 0, with 90% support and confidence of 83%.
If current payment status > 1.4, then a customer’s default status = 1, with 10% support and confidence of 69%.
## Accuracy Kappa Resample
## 1 0.8111950 0.3207329 Fold01
## 2 0.8173333 0.3395313 Fold02
## 3 0.8182222 0.3554454 Fold05
## 4 0.8176968 0.3437773 Fold04
## 5 0.8271111 0.3869639 Fold03
## 6 0.8066667 0.3031913 Fold06
## 7 0.8146667 0.3201471 Fold09
## 8 0.8168889 0.3555922 Fold08
## 9 0.8297778 0.3889052 Fold07
## 10 0.8280000 0.3922510 Fold10
# predicted values
testsetpreds <- predict(fit, def.te.z)
# table of predictions against test values
table(def.te.z$Default, testsetpreds)## testsetpreds
## 0 1
## 0 5609 232
## 1 1102 557
## 0 1
## 5841 1659
##
## 0 1
## 5841 1659
# accuracy: (5609+557) / 7500= 0.8221333
# sensitivity = 557 / 1659 = 0.3357444
# specificity = 5609 / 5841 = 0.9603
# precision = 557/ (232 + 557) = 0.7060
# recall = sensitivity
# Fbeta:
# F1
# (1 + (1^2)) * ((0.7060 * 0.33574) / (((1*1) * 0.7060) + 0.3357))
# 2 * 0.2275439
# = 0.4550878
# F2
# (1 + (2^2)) * ((0.7060 * 0.33574) / (((2^2) * 0.7060) + 0.3357))
# 0.3750869
# F0.5
# (1 + (0.5^2)) * ((0.7060 * 0.33574) / (((0.5*0.5) * 0.7060) + 0.3357))
# 0.5784665A CART model was developed for classifying Default, using 10-fold cross-validation. The model was assessed for overfitting by analyzing accuracy statistics of the k-folds used to partition the data through cross-validation. Fold 9 (82.99%) has the highest level of accuracy and Fold 6 has the lowest level of accuracy (80.67%). All folds are within a 2.31% difference. We judge that this does not reflect significant evidence of overfitting with an allowance of 5% variation among the ten folds.
A Neural Networks model was also built to predict Default.
set.seed(123)
TC2 <- trainControl(
method = "CV",
number = 10)
fit2 <- train(Default ~ ., data = def.tr.z, method = "mlp", trControl = TC2)
fit2## Multi-Layer Perceptron
##
## 22500 samples
## 4 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 20250, 20251, 20250, 20251, 20250, 20250, ...
## Resampling results across tuning parameters:
##
## size Accuracy Kappa
## 1 0.8174220 0.3683108
## 3 0.8184883 0.3602527
## 5 0.8205774 0.3567152
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was size = 5.
## Accuracy Kappa Resample
## 1 0.8155556 0.3340375 Fold01
## 2 0.8176968 0.3760970 Fold02
## 3 0.8253333 0.3762776 Fold05
## 4 0.8248110 0.3590822 Fold04
## 5 0.8226667 0.3656329 Fold03
## 6 0.8182222 0.3508843 Fold06
## 7 0.8115556 0.3118611 Fold09
## 8 0.8240782 0.3642504 Fold08
## 9 0.8200000 0.3572326 Fold07
## 10 0.8258552 0.3717967 Fold10
# predicted values
testsetpreds2 <- predict(fit2, def.te.z)
# table of predictions against test values
table(def.te.z$Default, testsetpreds2)## testsetpreds2
## 0 1
## 0 5565 276
## 1 1071 588
##
## 0 1
## 5841 1659
# 0 1
# 0 5565 276
# 1 1071 588
# accuracy: (5565 + 588)/ 7500 = 0.8204
# sensitivity = 588 / 1659 = 0.3544304
# specificity = 5565/5841 = 0.9527478
# precision = 588 / (276 +588) = 0.6806
# recall = true pos / total actually positives = 629/(629+326) = 0.3544
# Fbeta
# F1
# (1 + (1^2)) * ((0.6806 * 0.3544) / (((1*1) * 0.6806) + 0.3544))
# 0.4660959
# F2
# (1 + (2^2)) * ((0.6806 * 0.3544) / (((2*2) * 0.6806) + 0.3544))
# 0.3919732
# F0.5
# (1 + (0.5^2)) * ((0.6806 * 0.3544) / (((0.5^2) * 0.6806) + 0.3544))
# 0.5747894A neural networks model was developed for classifying Default, using 10-fold cross-validation. The model was checked for overfitting by analyzing accuracy statistics of the k-folds used to partition the data through cross-validation. Fold 3 (83.15%) has the highest level of accuracy and Fold 2 has the lowest level of accuracy (80.84%). All folds are within a 2.31% difference. We judge that this does not reflect significant evidence of overfitting with an allowance of 5% variation among the ten folds.
TN/FP/FN/TP
| All P o s i tive M odel | All N e g a tive M odel | CART M odel | NN Model | |
|---|---|---|---|---|
| TN | 0 | 17 ,523 | 5 ,609 | 5,565 |
| FP | 1 7 ,523 | 0 | 1 ,102 | 276 |
| FN | 0 | 4 ,977 | 232 | 1,071 |
| TP | 4 ,977 | 0 | 557 | 588 |
All-negative and all-positive models were constructed to establish baseline performance. Non-defaulters represent the all-negative model and defaulters represent the all-positive model. The all-negative model (accuracy = 75.92%) was used as the baseline model given the high accuracy compared to the all-positive model (accuracy = 24.08%) which has poor accuracy and a high error rate. Metrics from the CART and neural networks model are compared to the baseline model.
Accuracy/Error Rate
| All P o s i tive M odel | All N e g a t ive M o del | C ART M o del | NN M o del | |
|---|---|---|---|---|
| A c c u r a c y | 0 . 2840 | 0 . 7 592 | 0 . 821 3 | 0 . 8 204 |
|
0 . 7260 | 0 . 2 408 | 0 . 178 7 | 0 . 1 796 |
The all-negative model is the baseline model with nothing but negative predictions and an overall accuracy of 75.92%. The CART and neural network (NN) model performed similarly, but the CART model beat the neural networks model in accuracy by about 0.09%. The CART model reduced the error rate by 6.21% from 24.08% in the baseline model, to 17.87%, making it overall the best model.
Sensitivity/Specificity
| All P o s i t ive M o del | All N e g a t ive M o del | CART M odel | NN Model | |
|---|---|---|---|---|
| Sens itivity | 1.0 | 0.0 | 0 . 3357 | 0 . 3544 |
| Spec ificity | 0.0 | 1.0 | 0 . 9603 | 0 . 9528 |
The CART and NN models performed poorly with respect to sensitivity. Of the 1659 actually positive responses, the CART model uncovered only 557 of them, the NN model uncovered 588, for a sensitivity of only 33.57% and 33.54% respectively.
The CART and NN models performed much better with respect to specificity. The CART model has a specificity of 96.03%, correctly identifying 5609 of the 5841 non-defaulting customers. The NN model has a specificity of 95.28%, identifying 5565 of the 5841 non-defaulting customers.
Precision/Recall
| All Positive Model | All Negative Model | CART Model | NN Model | |
|---|---|---|---|---|
| Pre cision | 0.2840 | NA | 0.7060 | 0.6586 |
| Recall | 1.0 | 0.0 | 0.3357 | 0.3544 |
The all-positive model has a precision value of 28.40%, which is the proportion of records with default status = 1. The all-negative model has a precision value that is undefined as no records were identified as positive. The CART model performed the best in terms of precision (70.60%). Of the 789 total positive predictions it made, 557 were positive. The NN model performed slightly poorer with a precision value of 65.86%, correctly making 629 of the 955 positive predictions.
Recall represents the proportion of correctly identified positive predictions among all positive predictions. Outside of the baseline model, the NN model scored highest, outperforming the CART model by 1.87%. Precision and recall are undefined for the all-negative model as it does not predict any records positively.
𝐹𝛽Scores
| All Positive Model | All Negative Model | CART Model | NN Model | |
|---|---|---|---|---|
| \(F_1\) | 0.4424 | NA | 0.4551 | 0.4661 |
$F_0.5$ |
0.3315 | NA | 0.5785 | 0.5748 |
| \(F_2\) | 0.6648 | NA | 0.3751 | 0.3920 |
F𝛽 scores were constructed as the final model evaluation metric, which represents a weighted combination of precision and recall measures. When recall and precision are equally weighted (𝛽=1), the NN model performs the greatest of all models. The NN model had acceptable levels of precision but poor recall and a 𝐹1 score of 0.4661, just a 0.11 increase from the CART model of 0.4551.
The CART model performed the poorest at 𝐹2 (0.3751), because its sensitivity (recall) was so low (33.57%) and 𝐹2 weights sensitivity higher than precision. The NN model is slightly better at 𝐹2 = 0.3920. The all-positive model scored the highest for 𝐹2 given its 100% sensitivity. When recall is weighted lower than precision (𝛽=0.5), the CART model has the highest performance (0.5785) and is the highest score among all models given its good precision (70.60%). NN model is just slightly less powerful with a score of 0.5748 with a 0.037 difference. Precision and recall are undefined for the all-negative model therefore 𝐹𝛽 scores are not applicable.
Overall, the CART and NN model perform comparably, however the CART model has greater precision and accuracy and therefore will produce more powerful results.
Works Cited
Iranian Churn Dataset. (2020). UCI Machine Learning Repository. https://doi.org/10.24432/C5JW3Z.