Probem Understanding

Gathered from the UC Irvine Machine Learning Repository, this data contains randomly collected from an Iranian telecom company’s database over a period of 12 months. A total of 3150 rows of data, each representing a customer, bear information for 13 columns. The attributes that are in this dataset are call failures, frequency of SMS, number of complaints, number of distinct calls, subscription length, age group, the charge amount, type of service, seconds of use, status, frequency of use, and Customer Value.

All of the attributes except for attribute churn is the aggregated data of the first 9 months. The churn labels are the state of the customers at the end of 12 months. The three months is the designated planning gap.

Task: Predict whether a credit card customer will default.

Data values:

2: No consumption
1: Payment made on time
0: Use of revolving credit
1: No payment made for the past one month
2: No payment made for the past two months
9: No payment made for the past nine months

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

## Warning: package 'caret' was built under R version 4.3.2

## Loading required package: lattice

## Warning: package 'rattle' was built under R version 4.3.2

## Loading required package: tibble

## Loading required package: bitops

## Rattle: A free graphical interface for data science with R.
## Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

# import testing and training sets
def.te<- read.csv("def_test")
def.tr<- read.csv("def_train")

# nullify index var
def.te$X<- NULL
def.tr$X<- NULL

Data Understanding

# summary statistics
summary(def.te)

##   Credit.Limit        PS.Curr           PS.1M.ago        PS.2M.ago      
##  Min.   :  10000   Min.   :-2.00000   Min.   :-2.000   Min.   :-2.0000  
##  1st Qu.:  50000   1st Qu.:-1.00000   1st Qu.:-1.000   1st Qu.:-1.0000  
##  Median : 140000   Median : 0.00000   Median : 0.000   Median : 0.0000  
##  Mean   : 167309   Mean   :-0.01613   Mean   :-0.124   Mean   :-0.1611  
##  3rd Qu.: 240000   3rd Qu.: 0.00000   3rd Qu.: 0.000   3rd Qu.: 0.0000  
##  Max.   :1000000   Max.   : 8.00000   Max.   : 7.000   Max.   : 8.0000  
##     Default      
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.2212  
##  3rd Qu.:0.0000  
##  Max.   :1.0000

summary(def.tr)

##   Credit.Limit       PS.Curr           PS.1M.ago        PS.2M.ago      
##  Min.   : 10000   Min.   :-2.00000   Min.   :-2.000   Min.   :-2.0000  
##  1st Qu.: 50000   1st Qu.:-1.00000   1st Qu.:-1.000   1st Qu.:-1.0000  
##  Median :140000   Median : 0.00000   Median : 0.000   Median : 0.0000  
##  Mean   :167543   Mean   :-0.01689   Mean   :-0.137   Mean   :-0.1679  
##  3rd Qu.:240000   3rd Qu.: 0.00000   3rd Qu.: 0.000   3rd Qu.: 0.0000  
##  Max.   :800000   Max.   : 8.00000   Max.   : 8.000   Max.   : 8.0000  
##     Default      
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.2212  
##  3rd Qu.:0.0000  
##  Max.   :1.0000

# set Default as factor
def.tr$Default<- as.factor(def.tr$Default)
def.te$Default<- as.factor(def.te$Default)


# Non-normalized histogram

 ggplot(def.tr, aes(PS.Curr))+
   geom_histogram(aes(fill = Default),
        color = "black", binwidth= 1) + 
   scale_fill_manual(name="Default Status",
                     values=c("#1f78b480","#33a02c80"), 
                     labels=c("No","Yes"))+
   xlab("Payment Status") + 
   ylab("Customer Frequency") + 
   ggtitle("Histogram of CPS and Default Status")

From the non-normalized histogram we see the majority of non-defaulters use revolving credit to make their payments (X = 0). The second largest portion of non-defaulters pay their bill on time (X = -1).

# Normalized histogram

ggplot(def.tr, aes(PS.Curr)) +
  geom_histogram(aes(fill = Default),
  color = "black", binwidth= 1, position = "fill") +
  scale_fill_manual(name="Default Status", 
                    values=c("#1f78b480","#33a02c80"), 
                    labels=c("No","Yes")) +
  xlab("Payment Status") +
  ylab("Customer Frequency") + 
  ggtitle("Normalized Histogram of CPS and Default Status")

The normalized histogram illustrate that higher rates of defaulters are positively associated with an increase in the number of months a customer has gone without making a payment.

# Comparison box plot of customer payment status as of one month ago, by Default.  

boxplot(PS.1M.ago ~ as.factor(Default), 
        col = "#1f78b480", data = def.tr, 
        main="Boxplot of CPS (1M) by Default Status",
        xlab="Default Status",
        ylab="CPS as of one month ago")

The median value (Q2) of customer payment status is nearly 0 for both customers who do and do not default. The lower quartile (Q1) of both boxplots sit near a value of 1, indicating there are a significant number of customers in both groups that pay their bill on time. However, the upper quartile (Q3) of the boxplot for defaulters is higher and sits around 2, suggesting default status is associated more with overdue payments. The maximum value of defaulters sits around 6, creating a wider range of values for defaulters.

Although there are many outlier points representing payment status for non-defaulters, the high maximum value of payment status for defaulters suggests there are more customers who do not pay their bill on time within that group.

# Kruskal-Wallis test associated with the above box plot
kruskal.test(PS.1M.ago ~ as.factor(Default), data = def.tr)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  PS.1M.ago by as.factor(Default)
## Kruskal-Wallis chi-squared = 1006.3, df = 1, p-value < 2.2e-16

To test whether a defaulters have significantly higher CPS values compared to non-defaulters, a Kruskal-Wallis test was conducted. The null hypothesis assumes PS.1M.ago scores among defaulters and non-defaulters are the same. The alternative hypothesis assumes PS.1M.ago scores are statistically different.

Results indicate chi-squared = 1006.3 with an extremely significant p-value of 0.001***. The null hypothesis is rejected and the alternative hypothesis is supported. PS.1M.ago scores are significantly higher for defaulters than for non-defaulters, suggesting a relationship between overdue payments and defaulting.

# predictor correlations
set.seed(123)
cor(def.tr[, c(1:4)])

##              Credit.Limit    PS.Curr  PS.1M.ago  PS.2M.ago
## Credit.Limit    1.0000000 -0.2715057 -0.2979874 -0.2861997
## PS.Curr        -0.2715057  1.0000000  0.6704944  0.5735308
## PS.1M.ago      -0.2979874  0.6704944  1.0000000  0.7676715
## PS.2M.ago      -0.2861997  0.5735308  0.7676715  1.0000000

# correlation plot 
pairs.panels(def.tr[, c(1:4)], 
             hist.col="#33a02c80", 
             show.points=TRUE, 
             stars=TRUE, 
             gap=0.05, 
             ellipses=FALSE, 
             scale=FALSE,
             jiggle=TRUE,
             factor=2,
             pch=".", 
             main="Correlation Plot", 
             col="blue", 
             font=2)

All predictors are negatively correlated with credit limit scores suggesting as credit limit scores rise, payment status scores fall, or the less overdue payments there are.

The largest correlation is between payment status as of one month and payment status as of two months ago at a value of 0.77. This might suggest the more months a customer goes without paying, the more likely they will not pay the following month.

Current payment status, payment status of one month ago and payment status of two months ago are all positively correlated. This suggests a strong relationship between customers with overdue payments in the current month and overdue payments in previous months.

Data Preparation

# standardize training set, def.tr.z.  

set.seed(123)
preprocess_z <-
  preProcess(def.tr,
             method=c("center", "scale"))

def.tr.z <-
  predict(preprocess_z, def.tr)
describe(def.tr.z)

##              vars     n mean   sd median trimmed  mad   min  max range skew
## Credit.Limit    1 22500 0.00 1.00  -0.21   -0.12 1.03 -1.22 4.89  6.11 0.98
## PS.Curr         2 22500 0.00 1.00   0.02   -0.04 1.32 -1.77 7.15  8.92 0.70
## PS.1M.ago       3 22500 0.00 1.00   0.11   -0.05 0.00 -1.56 6.80  8.36 0.78
## PS.2M.ago       4 22500 0.00 1.00   0.14   -0.06 0.00 -1.53 6.84  8.37 0.82
## Default*        5 22500 1.22 0.42   1.00    1.15 0.00  1.00 2.00  1.00 1.34
##              kurtosis   se
## Credit.Limit     0.46 0.01
## PS.Curr          2.53 0.01
## PS.1M.ago        1.50 0.01
## PS.2M.ago        1.94 0.01
## Default*        -0.20 0.00

# standardize test set, def.te.z
 preprocess_ze <- preProcess(def.te,
             method=c("center", "scale"))

def.te.z <-
  predict(preprocess_ze, def.te)
describe(def.te.z)

##              vars    n mean   sd median trimmed  mad   min  max range skew
## Credit.Limit    1 7500 0.00 1.00  -0.21   -0.12 1.02 -1.20 6.36  7.56 1.03
## PS.Curr         2 7500 0.00 1.00   0.01   -0.04 1.31 -1.75 7.07  8.82 0.81
## PS.1M.ago       3 7500 0.00 1.00   0.10   -0.05 0.00 -1.56 5.93  7.49 0.83
## PS.2M.ago       4 7500 0.00 1.00   0.13   -0.06 0.00 -1.53 6.77  8.30 0.90
## Default*        5 7500 1.22 0.42   1.00    1.15 0.00  1.00 2.00  1.00 1.34
##              kurtosis   se
## Credit.Limit     0.75 0.01
## PS.Curr          3.25 0.01
## PS.1M.ago        1.78 0.01
## PS.2M.ago        2.50 0.01
## Default*        -0.20 0.00

# save test and training set
write.csv(def.te.z, "def.te.z")
write.csv(def.tr.z, "def.tr.z")

The test and training sets were standardized. Summary statistics show successful standardization with standard deviations of 1 and means of 0 for each predictor variable.

Model Building

All Positive Model

# distribution of Default
table(def.tr.z$Default)

## 
##     0     1 
## 17523  4977

Using the distribution of Default, we can create an all-positive and all-negative model to establish baseline model performance. Customers with Default = 0 represent the all-negative model and Default = 1 represent the all positive model.

All Positive Model

	Predicted Category
Actual Category	Default = 0	Default = 1	Total
Default = 0	0	17,523	17,523
Default = 1	0	4,977	4,977
Total	0	22,500	22,500

All Negative Model

All Negative Model

	Predicted Category
Actual Category	Default = 0	Default = 1	Total
Default = 0	17,523	0	17,523
Default = 1	4,977	0	4,977
Total	22,500	0	22,500

$Accuracy = TN + TP/GT= (17523+0) / 22500= 77.88%$

The all-negative model (accuracy = 77.88%) will be used as the baseline model given the high accuracy compared to the all-positive model (accuracy = 22.12%). Our model will have to beat the all-negative model.

Baseline Model Metrics

# Note: 
# default (1) as positive  = 4977
# no default (0) as negative  =  17523


# precision:  0 /17523  = 0
# accuracy: (0 + 17523)/ 22,500 =  0.7788
# sensitivity = 0 / 0  =  0
# specificity = 4,977 /22,500 = 0.2212
# precision =  0 / (0 +4,977) = 0
# recall = 0 / 0 + 4,977 =   =  0

# Fbeta = precision * recall / (beta^2 * precision) + recall

# 1 + beta^2 * (0.2840 * 1) / ((0.2840) + 1) =  0.2211838
# 2 * 0.2211838
# F1 =  0.4423676

# 1 + beta * (0.2840 *1)/ (((2*2) * 0.2840) + 1)
# (1 + 2^2) * 0.1329588
# F2 = 0.664794

# 1 + beta * (0.2840 *1)/ (((0.5*0.5) * 0.2840) + 1)
# (1 + 0.5^2) *  0.2651727
# F0.5 = 0.3314659

CART Model

A Classification and Regression Tree model was employed to predict Default.

set.seed(1000)

TC <- trainControl(
  method = "CV",
  number = 10)

fit <- train(Default ~ ., data = def.tr.z, 
             method = "rpart2", #CART
             trControl = TC)
fancyRpartPlot(fit$finalModel, main = "CART Model: Predicting Default Status")

The above decision tree predicts default status based on variables Credit.Limit, PS.Curr, PS.1m.ago, and PS.2m.ago using a classification and regression-based algorithm. In this chart, blue leaves indicate defaulting customers (22.12% of records in the training data set) and green leaves indicate low-income earners (87.88% of records). The split is made for whether the value of a customer’s current payment status is less than 1.4, or if a payment was made longer than one month ago. Decision rules are as follows:

If current payment status < 1.4, then a customer’s default status = 0, with 90% support and confidence of 83%.
If current payment status > 1.4, then a customer’s default status = 1, with 10% support and confidence of 69%.

CART Model Metrics

# check for overfitting
set.seed(1000)
fit$resample

##     Accuracy     Kappa Resample
## 1  0.8111950 0.3207329   Fold01
## 2  0.8173333 0.3395313   Fold02
## 3  0.8182222 0.3554454   Fold05
## 4  0.8176968 0.3437773   Fold04
## 5  0.8271111 0.3869639   Fold03
## 6  0.8066667 0.3031913   Fold06
## 7  0.8146667 0.3201471   Fold09
## 8  0.8168889 0.3555922   Fold08
## 9  0.8297778 0.3889052   Fold07
## 10 0.8280000 0.3922510   Fold10

# predicted values
testsetpreds <- predict(fit, def.te.z)

# table of predictions against test values
table(def.te.z$Default, testsetpreds)

##    testsetpreds
##        0    1
##   0 5609  232
##   1 1102  557

summary(def.te.z$Default)

##    0    1 
## 5841 1659

table(def.te.z$Default)

## 
##    0    1 
## 5841 1659

# accuracy:  (5609+557) / 7500=  0.8221333 
# sensitivity = 557 / 1659 = 0.3357444
# specificity = 5609 / 5841 = 0.9603
# precision = 557/ (232 + 557) = 0.7060
# recall = sensitivity

# Fbeta: 

# F1
# (1 + (1^2)) * ((0.7060 * 0.33574) / (((1*1) * 0.7060) + 0.3357))
# 2 * 0.2275439
#  = 0.4550878

# F2
# (1 + (2^2)) * ((0.7060 * 0.33574) / (((2^2) * 0.7060) + 0.3357))
# 0.3750869

# F0.5
# (1 + (0.5^2)) * ((0.7060 * 0.33574) / (((0.5*0.5) * 0.7060) + 0.3357))
# 0.5784665

A CART model was developed for classifying Default, using 10-fold cross-validation. The model was assessed for overfitting by analyzing accuracy statistics of the k-folds used to partition the data through cross-validation. Fold 9 (82.99%) has the highest level of accuracy and Fold 6 has the lowest level of accuracy (80.67%). All folds are within a 2.31% difference. We judge that this does not reflect significant evidence of overfitting with an allowance of 5% variation among the ten folds.

Neural Networks Model

A Neural Networks model was also built to predict Default.

set.seed(123)
TC2 <- trainControl(
  method = "CV",
  number = 10)

fit2 <- train(Default ~ ., data = def.tr.z, method = "mlp", trControl = TC2)
fit2

## Multi-Layer Perceptron 
## 
## 22500 samples
##     4 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 20250, 20251, 20250, 20251, 20250, 20250, ... 
## Resampling results across tuning parameters:
## 
##   size  Accuracy   Kappa    
##   1     0.8174220  0.3683108
##   3     0.8184883  0.3602527
##   5     0.8205774  0.3567152
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was size = 5.

# check model for overfitting
fit2$resample

##     Accuracy     Kappa Resample
## 1  0.8155556 0.3340375   Fold01
## 2  0.8176968 0.3760970   Fold02
## 3  0.8253333 0.3762776   Fold05
## 4  0.8248110 0.3590822   Fold04
## 5  0.8226667 0.3656329   Fold03
## 6  0.8182222 0.3508843   Fold06
## 7  0.8115556 0.3118611   Fold09
## 8  0.8240782 0.3642504   Fold08
## 9  0.8200000 0.3572326   Fold07
## 10 0.8258552 0.3717967   Fold10

# predicted values 
testsetpreds2 <- predict(fit2, def.te.z)

# table of predictions against test values
table(def.te.z$Default, testsetpreds2)

##    testsetpreds2
##        0    1
##   0 5565  276
##   1 1071  588

table (def.te.z$Default)

## 
##    0    1 
## 5841 1659

Neural Networks Model Metrics

#    0    1
# 0 5565  276
# 1 1071  588

# accuracy: (5565 + 588)/ 7500 =  0.8204
# sensitivity = 588 / 1659 =  0.3544304
# specificity = 5565/5841 = 0.9527478
# precision =  588 / (276 +588) = 0.6806
# recall = true pos / total actually positives = 629/(629+326)  =  0.3544

# Fbeta

# F1
# (1 + (1^2)) * ((0.6806 * 0.3544) / (((1*1) * 0.6806) + 0.3544))
# 0.4660959


# F2
# (1 + (2^2)) * ((0.6806 * 0.3544) / (((2*2) * 0.6806) + 0.3544))
# 0.3919732

# F0.5
# (1 + (0.5^2)) * ((0.6806 * 0.3544) / (((0.5^2) * 0.6806) + 0.3544))
# 0.5747894

A neural networks model was developed for classifying Default, using 10-fold cross-validation. The model was checked for overfitting by analyzing accuracy statistics of the k-folds used to partition the data through cross-validation. Fold 3 (83.15%) has the highest level of accuracy and Fold 2 has the lowest level of accuracy (80.84%). All folds are within a 2.31% difference. We judge that this does not reflect significant evidence of overfitting with an allowance of 5% variation among the ten folds.

Model Evaluation

TN/FP/FN/TP

	All P o s i tive M odel	All N e g a tive M odel	CART M odel	NN Model
TN	0	17 ,523	5 ,609	5,565
FP	1 7 ,523	0	1 ,102	276
FN	0	4 ,977	232	1,071
TP	4 ,977	0	557	588

All-negative and all-positive models were constructed to establish baseline performance. Non-defaulters represent the all-negative model and defaulters represent the all-positive model. The all-negative model (accuracy = 75.92%) was used as the baseline model given the high accuracy compared to the all-positive model (accuracy = 24.08%) which has poor accuracy and a high error rate. Metrics from the CART and neural networks model are compared to the baseline model.

Accuracy/Error Rate

	All P o s i tive M odel	All N e g a t ive M o del	C ART M o del	NN M o del
A c c u r a c y	0 . 2840	0 . 7 592	0 . 821 3	0 . 8 204
Er ror Rat e*	0 . 7260	0 . 2 408	0 . 178 7	0 . 1 796

The all-negative model is the baseline model with nothing but negative predictions and an overall accuracy of 75.92%. The CART and neural network (NN) model performed similarly, but the CART model beat the neural networks model in accuracy by about 0.09%. The CART model reduced the error rate by 6.21% from 24.08% in the baseline model, to 17.87%, making it overall the best model.

Sensitivity/Specificity

	All P o s i t ive M o del	All N e g a t ive M o del	CART M odel	NN Model
Sens itivity	1.0	0.0	0 . 3357	0 . 3544
Spec ificity	0.0	1.0	0 . 9603	0 . 9528

The CART and NN models performed poorly with respect to sensitivity. Of the 1659 actually positive responses, the CART model uncovered only 557 of them, the NN model uncovered 588, for a sensitivity of only 33.57% and 33.54% respectively.

The CART and NN models performed much better with respect to specificity. The CART model has a specificity of 96.03%, correctly identifying 5609 of the 5841 non-defaulting customers. The NN model has a specificity of 95.28%, identifying 5565 of the 5841 non-defaulting customers.

Precision/Recall

	All Positive Model	All Negative Model	CART Model	NN Model
Pre cision	0.2840	NA	0.7060	0.6586
Recall	1.0	0.0	0.3357	0.3544

The all-positive model has a precision value of 28.40%, which is the proportion of records with default status = 1. The all-negative model has a precision value that is undefined as no records were identified as positive. The CART model performed the best in terms of precision (70.60%). Of the 789 total positive predictions it made, 557 were positive. The NN model performed slightly poorer with a precision value of 65.86%, correctly making 629 of the 955 positive predictions.

Recall represents the proportion of correctly identified positive predictions among all positive predictions. Outside of the baseline model, the NN model scored highest, outperforming the CART model by 1.87%. Precision and recall are undefined for the all-negative model as it does not predict any records positively.

𝐹𝛽Scores

	All Positive Model	All Negative Model	CART Model	NN Model
$F_1$	0.4424	NA	0.4551	0.4661
$F_0.5$	0.3315	NA	0.5785	0.5748
$F_2$	0.6648	NA	0.3751	0.3920

F𝛽 scores were constructed as the final model evaluation metric, which represents a weighted combination of precision and recall measures. When recall and precision are equally weighted (𝛽=1), the NN model performs the greatest of all models. The NN model had acceptable levels of precision but poor recall and a 𝐹1 score of 0.4661, just a 0.11 increase from the CART model of 0.4551.

The CART model performed the poorest at 𝐹2 (0.3751), because its sensitivity (recall) was so low (33.57%) and 𝐹2 weights sensitivity higher than precision. The NN model is slightly better at 𝐹2 = 0.3920. The all-positive model scored the highest for 𝐹2 given its 100% sensitivity. When recall is weighted lower than precision (𝛽=0.5), the CART model has the highest performance (0.5785) and is the highest score among all models given its good precision (70.60%). NN model is just slightly less powerful with a score of 0.5748 with a 0.037 difference. Precision and recall are undefined for the all-negative model therefore 𝐹𝛽 scores are not applicable.

Overall, the CART and NN model perform comparably, however the CART model has greater precision and accuracy and therefore will produce more powerful results.

Works Cited

Iranian Churn Dataset. (2020). UCI Machine Learning Repository. https://doi.org/10.24432/C5JW3Z.

Predicting Churning Customers

Acadia Berry

12/13/2023