Identifying the Key Drivers of Credit Card Debt and Segmenting Customers by Financial Behavior
Author
Favour Adekunle and Thanh Nguyen
KSU LOGO
QR CODE
Group Name:
Analytical Minds
Group Members:
Favour Adekunle
Thanh Nguyen
Introduction
Credit cards have become one of the most widely used financial tools for everyday transactions, online purchases, emergency spending, and short-term access to credit. They provide convenience and financial flexibility for customers, while also serving as an important revenue source for banks and financial institutions. However, improper credit card usage can lead to increasing debt balances, missed payments, and financial stress for customers.
Understanding the factors that contribute to credit card debt is important for both consumers and financial institutions. Customers can benefit from insights into spending and repayment behavior, while banks can use data-driven findings to improve customer risk assessment, product design, and financial support strategies.
This project focuses on analyzing customer credit card behavior using the Credit Card Customers Dataset (CC GENERAL) from Kaggle. The dataset contains information on customer balances, purchases, cash advances, credit limits, payments, and usage frequency. By examining these financial behaviors, the project aims to identify the key drivers of credit card debt and group customers into meaningful segments based on their financial patterns.
The project will apply data wrangling, visualization, regression modeling, clustering, and interactive reporting in R using methods covered in class. The results of this project can provide practical recommendations for improving credit management, reducing excessive debt, and understanding different customer financial profiles.
Project Objective
The main objective of this project is to determine which customer financial behaviors are most associated with higher credit card balances and to classify customers into groups based on their spending, borrowing, and repayment patterns.
Data Description
This project uses the Credit Card Customers Dataset (CC GENERAL) obtained from Kaggle. The dataset contains information on customer credit card usage behavior and financial activity. It includes variables related to balances, purchases, payments, cash advances, credit limits, and transaction frequency. The dataset is appropriate for this project because it provides customer-level financial data that can be used to analyze the main factors associated with credit card debt and spending behavior. Each row in the dataset represents one credit card customer and the dataset contains 8,950 observations (customers).
Key Variables
Variables
Description
CUST_ID
Unique customer identification number
BALANCE
Current outstanding balance on the credit card
PURCHASES
Total amount of purchases made
ONEOFF_PURCHASES
Amount spent on one-time purchases
INSTALLMENTS_PURCHASES
Amount spent through installment purchases
CASH_ADVANCE
Cash withdrawn using the credit card
CREDIT_LIMIT
Maximum available credit limit
PAYMENTS
Total payments made by the customer
MINIMUM_PAYMENTS
Minimum required payment
PURCHASES_FREQUENCY
Frequency of purchases
CASH_ADVANCE_FREQUENCY
Frequency of cash advance usage
PRC_FULL_PAYMENT
Percentage of full payments made
TENURE
Number of months as a customer
Target Variable
The target (dependent) variable for the regression analysis is BALANCE. This is used as a measure of customer credit card debt.
Variables for Clustering
To group customers into similar financial behavior segments, the clustering analysis will use selected independent variables such as:
PURCHASES
CASH_ADVANCE
PAYMENTS
CREDIT_LIMIT
PURCHASES_FREQUENCY
TENURE
(BALANCE will be excluded from clustering as required in the project checklist.)
Importing the Data
#install.packages("tidyverse")library(tidyverse)# Import the datasetcc <-read.csv("CC GENERAL.csv")# Inspecting the datasetstr(cc)
The dataset contains 8,950 customer observations and 18 financial variables related to balances, purchases, payments, cash advances, and credit limits. Initial inspection shows that most variables are numeric and suitable for quantitative analysis. Missing values were detected in MINIMUM_PAYMENTS (313 missing values) and CREDIT_LIMIT (1 missing value), which require cleaning before regression and clustering.
CUST_ID BALANCE PURCHASES CASH_ADVANCE
Length:8636 Min. : 0.0 Min. : 0.00 Min. : 0.0
Class :character 1st Qu.: 148.1 1st Qu.: 43.37 1st Qu.: 0.0
Mode :character Median : 916.9 Median : 375.40 Median : 0.0
Mean : 1601.2 Mean : 1025.43 Mean : 994.2
3rd Qu.: 2105.2 3rd Qu.: 1145.98 3rd Qu.: 1132.4
Max. :19043.1 Max. :49039.57 Max. :47137.2
CREDIT_LIMIT PAYMENTS MINIMUM_PAYMENTS PURCHASES_FREQUENCY
Min. : 50 Min. : 0.05 Min. : 0.019 Min. :0.00000
1st Qu.: 1600 1st Qu.: 418.56 1st Qu.: 169.164 1st Qu.:0.08333
Median : 3000 Median : 896.68 Median : 312.452 Median :0.50000
Mean : 4522 Mean : 1784.48 Mean : 864.305 Mean :0.49600
3rd Qu.: 6500 3rd Qu.: 1951.14 3rd Qu.: 825.496 3rd Qu.:0.91667
Max. :30000 Max. :50721.48 Max. :76406.208 Max. :1.00000
PRC_FULL_PAYMENT TENURE debt_ratio payment_ratio
Min. :0.0000 Min. : 6.00 Min. : 0.00000 Min. :0.0009695
1st Qu.:0.0000 1st Qu.:12.00 1st Qu.: 0.04741 1st Qu.:0.3515823
Median :0.0000 Median :12.00 Median : 0.31826 Median :1.4828256
Mean :0.1593 Mean :11.53 Mean : 0.39772 Mean : Inf
3rd Qu.:0.1667 3rd Qu.:12.00 3rd Qu.: 0.72575 3rd Qu.:7.7078825
Max. :1.0000 Max. :12.00 Max. :15.90995 Max. : Inf
dim(cc_clean)
[1] 8636 12
After selecting relevant variables and removing missing values, the cleaned dataset contains 8,636 customers and 12 variables. Two additional variables were created using mutate(): debt_ratio, which measures balance relative to credit limit, and payment_ratio, which measures payments relative to balance. These derived variables help evaluate customer credit utilization and repayment behavior.
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000e+00 3.500e-01 1.480e+00 1.958e+02 7.700e+00 1.186e+06 6
IMPORTANT OBSERVATION:
One of the newly created variables (payment_ratio) contains infinite values, (since payment_ratio = PAYMENTS / BALANCE, for some customers BALANCE = 0, hence PAYMENTS / 0 = undefined), which is a mathematical error, even though the dataset is already cleaned for NA missing values. And if we keep the infinite values (inf), it can cause problems later, like plots axes may distort, regression may fail or bias results, and clustering can seriously damage cluster distances. So, for customers with zero balance, the values were replaced with missing values to ensure valid analysis. After correcting division-by-zero cases, the payment_ratio variable was successfully updated. Only 6 customers had undefined ratios due to zero balances and were recorded as missing values.
### Now we have the fully cleaned project dataset ###summary(cc_clean)
CUST_ID BALANCE PURCHASES CASH_ADVANCE
Length:8636 Min. : 0.0 Min. : 0.00 Min. : 0.0
Class :character 1st Qu.: 148.1 1st Qu.: 43.37 1st Qu.: 0.0
Mode :character Median : 916.9 Median : 375.40 Median : 0.0
Mean : 1601.2 Mean : 1025.43 Mean : 994.2
3rd Qu.: 2105.2 3rd Qu.: 1145.98 3rd Qu.: 1132.4
Max. :19043.1 Max. :49039.57 Max. :47137.2
CREDIT_LIMIT PAYMENTS MINIMUM_PAYMENTS PURCHASES_FREQUENCY
Min. : 50 Min. : 0.05 Min. : 0.019 Min. :0.00000
1st Qu.: 1600 1st Qu.: 418.56 1st Qu.: 169.164 1st Qu.:0.08333
Median : 3000 Median : 896.68 Median : 312.452 Median :0.50000
Mean : 4522 Mean : 1784.48 Mean : 864.305 Mean :0.49600
3rd Qu.: 6500 3rd Qu.: 1951.14 3rd Qu.: 825.496 3rd Qu.:0.91667
Max. :30000 Max. :50721.48 Max. :76406.208 Max. :1.00000
PRC_FULL_PAYMENT TENURE debt_ratio payment_ratio
Min. :0.0000 Min. : 6.00 Min. : 0.00000 Min. :0.000e+00
1st Qu.:0.0000 1st Qu.:12.00 1st Qu.: 0.04741 1st Qu.:3.500e-01
Median :0.0000 Median :12.00 Median : 0.31826 Median :1.480e+00
Mean :0.1593 Mean :11.53 Mean : 0.39772 Mean :1.958e+02
3rd Qu.:0.1667 3rd Qu.:12.00 3rd Qu.: 0.72575 3rd Qu.:7.700e+00
Max. :1.0000 Max. :12.00 Max. :15.90995 Max. :1.186e+06
NA's :6
After the data cleaning and transformation, the final working dataset (cc_clean) contained 8,636 customer observations and 12 variables, and which now be used for all subsequent analyses.
Data Visualizations
(This helps us to see patterns in the data).
# Figure 1: Scatter Plot: Purchases vs Credit Card Balanceplot1 <-ggplot(data = cc_clean) +geom_point(aes(x = PURCHASES, y = BALANCE)) +labs(title ="Purchases vs Credit Card Balance",x ="Purchases",y ="Balance" ) +theme_bw()plot1
Figure 1: Purchases vs Credit Card Balance
(This figure helps determine whether customer spending is associated with debt accumulation, which directly supports the project objective of identifying the key drivers of credit card debt).
The scatter plot shows the relationship between customer purchases and credit card balance. Most customers are concentrated at lower purchase levels and lower balances, indicating that many customers have moderate spending and relatively low debt. There are also several outliers with very high purchase amounts and high balances. In all, the relationship appears weakly positive, suggesting that higher purchases may contribute to higher credit card balances for some customers.
# Figure 2: Scatter Plot with Trend Line: Payments vs Balanceplot2 <-ggplot(data = cc_clean) +geom_point(aes(x = PAYMENTS, y = BALANCE)) +geom_smooth(aes(x = PAYMENTS, y = BALANCE)) +labs(title ="Payments vs Credit Card Balance",x ="Payments",y ="Balance" )plot2
Figure 2: Payments vs Credit Card Balance
(This figure helps explain repayment behavior and shows that payment amount alone may not indicate low debt, since customers with high balances often make higher payments).
The scatter plot shows the relationship between customer payments and credit card balance. Most customers are concentrated at lower payment amounts and lower balances. The smooth trend line shows a positive relationship, indicating that customers with higher balances also tend to make higher payments. This means that larger payments are often associated with customers carrying more debt rather than automatically reducing balances.
#Figure 3: Scatter Plot with Trend Line: Cash Advance vs Balanceplot3 <-ggplot(data = cc_clean) +geom_point(aes(x = CASH_ADVANCE, y = BALANCE)) +geom_smooth(aes(x = CASH_ADVANCE, y = BALANCE)) +labs(title ="Cash Advance vs Credit Card Balance",x ="Cash Advance",y ="Balance" ) +theme_bw()plot3
Figure 3: Cash Advance vs Credit Card Balance
(This figure helps identify risky borrowing behavior. Cash advances often carry extra fees and interest, so frequent or large withdrawals may contribute to higher debt).
The scatter plot shows the relationship between cash advance usage and credit card balance. Most customers have low cash advance amounts and lower balances. The smooth trend line slopes upward, indicating a positive relationship between cash advance and balance. Customers who withdraw larger cash advances tend to carry higher credit card debt.
#Figure 4: Histogram: Distribution of Credit Card Balanceplot4 <-ggplot(data = cc_clean) +geom_histogram(aes(x = BALANCE), bins =30) +labs(title ="Distribution of Credit Card Balance",x ="Balance",y ="Count" ) plot4
Figure 4: Distribution of Credit Card Balance
(This figure shows that credit card debt is not evenly distributed. Most customers manage lower balances, while a smaller group may represent higher debt risk).
The histogram shows the distribution of customer credit card balances. Most customers have relatively low balances, with the highest concentration near zero to moderate debt levels. The distribution is strongly right-skewed, meaning a small number of customers carry very high balances.
#Figure 5: Histogram: Distribution of Purchasesplot5 <-ggplot(data = cc_clean) +geom_histogram(aes(x = PURCHASES), bins =30) +labs(title ="Distribution of Customer Purchases",x ="Purchases",y ="Count" ) +theme_bw()plot5
Figure 5: Distribution of Customer Purchases
(This figure helps identify customer spending patterns and suggests that a small group of high spenders may significantly influence overall purchase totals).
The histogram shows the distribution of customer purchases. Most customers have low to moderate purchase amounts, while a small number of customers make very large purchases. The distribution is strongly right-skewed, indicating that high spending is concentrated among a few customers.
(This figure helps evaluate whether the length of customer relationship is associated with debt accumulation and balance variability).
This boxplot compares balances by tenure. Customers with shorter tenure generally have lower balances, while customers with longer tenure show higher typical balances and wider spread. This means that debt may increase over time for some customers.
# Figure 7: Bar Chart: Number of Customers by Tenureplot7 <-ggplot(data = cc_clean) +geom_bar(aes(x =factor(TENURE),fill =factor(TENURE))) +labs(title ="Number of Customers by Tenure",x ="Tenure",y ="Count",fill ="Tenure" )plot7
Figure 7: Number of Customers by Tenure
(This figure helps understand customer composition and explains why tenure 12 may strongly influence overall results).
The bar chart shows the number of customers in each tenure category. The largest proportion of customers have a tenure of 12 months, while much smaller numbers are observed in the shorter tenure groups. This indicates that most customers in the dataset are long-term customers.
#Figure 8: Scatter Plot: Credit Limit vs Balanceplot8 <-ggplot(data = cc_clean) +geom_point(aes(x = CREDIT_LIMIT, y = BALANCE)) +geom_smooth(aes(x = CREDIT_LIMIT, y = BALANCE)) +labs(title ="Credit Limit vs Credit Card Balance",x ="Credit Limit",y ="Balance" )plot8
Figure 8: Credit Limit vs Credit Card Balance
(This figure helps evaluate whether access to larger credit limits contributes to higher debt balances and whether borrowing behavior changes at higher limit levels).
The scatter plot shows the relationship between credit limit and credit card balance. The trend line initially rises, indicating that customers with higher credit limits tend to carry higher balances. At very high credit limits, the relationship levels off and slightly declines. This shows that moderate increases in credit limit are associated with higher debt, but the highest-limit customers may manage balances more effectively.
#Figure 9: Scatter Plot: Debt Ratio vs Balanceplot9 <-ggplot(data = cc_clean) +geom_point(aes(x = debt_ratio, y = BALANCE)) +geom_smooth(aes(x = debt_ratio, y = BALANCE)) +labs(title ="Debt Ratio vs Credit Card Balance",x ="Debt Ratio",y ="Balance" )plot9
Figure 9: Debt Ratio vs Credit Card Balance
(This figure helps evaluate credit utilization behavior and identifies customers who may be overextended relative to their available credit).
The scatter plot shows the relationship between debt ratio and credit card balance. Most customers are concentrated at lower debt ratios, indicating balances below their credit limits. Customers with higher balances are mostly observed within lower to moderate debt ratios. A small number of extreme debt ratio values appear as outliers. That is, the majority of customers are clustered at low debt ratios, showing moderate credit utilization. And some unusual cases with very high debt ratios are present, but they are few in number.
(This figure gives a quick summary of the most important customer behaviors related to credit card debt and helps compare multiple drivers in one view).
The combined dashboard summarizes four major relationships with credit card balance. Purchases show a weak positive relationship with balance. Payments display a positive relationship, indicating customers with larger balances often make larger payments. Cash advance has a clear positive relationship with balance, suggesting borrowing cash is associated with higher debt. Credit limit also shows a positive relationship at moderate levels before flattening at higher limits. In summary, this dashboard shows four things:
Spending more can increase debt.
People with bigger debt often pay more.
Taking cash from the card is linked to more debt.
Bigger credit limits can lead to bigger balances.
Data Modeling (Regression)
(This helps to measure and predict relationships in the data).
A multiple linear regression model was fitted to predict credit card balance (BALANCE) using PURCHASES, CASH_ADVANCE, PAYMENTS, CREDIT_LIMIT, and TENURE.
The overall regression model is statistically significant (F-statistic p < 2.2e-16), indicating that the selected predictors jointly explain customer credit card balances. The model has an Adjusted R-squared of 0.4203, meaning approximately 42% of the variation in balances is explained by the included variables.
All predictors were statistically significant (p < 0.001). Higher PURCHASES were associated with higher balances. Holding other variables constant, a one-unit increase in purchases increases balance by approximately 0.136 units. Higher CASH_ADVANCE was strongly associated with higher balances. Holding other variables constant, a one-unit increase in cash advance increases balance by approximately 0.446 units, making it one of the strongest positive predictors of debt. Higher PAYMENTS were associated with lower balances. Holding other variables constant, a one-unit increase in payments decreases balance by approximately 0.106 units. Higher CREDIT_LIMIT was associated with higher balances. Holding other variables constant, a one-unit increase in credit limit increases balance by approximately 0.231 units. Longer TENURE was also associated with higher balances. Holding other variables constant, each additional unit increase in tenure increases balance by approximately 76.95 units, suggesting that longer-term customers tend to carry higher debt balances.
Hence, the results suggest that cash advance usage, credit limit, purchases, and longer tenure are important drivers of higher credit card debt, while higher payments help reduce debt balances.
Clustering Analysis
After removing the target (dependent) variable “BALANCE” we will cluster customers using only independent financial behavior variables. This helps identify low-risk customers, high-spending customers, heavy borrowers, disciplined payers. The variables for Clustering are; PURCHASES (spending), CASH_ADVANCE (borrowing), PAYMENTS (repayment), CREDIT_LIMIT (available credit), TENURE (customer duration).
*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 8 proposed 2 as the best number of clusters
* 2 proposed 3 as the best number of clusters
* 8 proposed 4 as the best number of clusters
* 4 proposed 5 as the best number of clusters
* 2 proposed 6 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 2
*******************************************************************
The PAM (Partitioning Around Medoids) values are generally below average for purchases, cash advance, payments, and credit limit. Cluster 1 contains customers with lower spending activity, lower borrowing behavior, smaller credit limits, and lower repayment amounts. Hence, cluster 1 comprises of smaller users; spend less, borrow less and lower limits.
Cluster 2
The PAM (Partitioning Around Medoids) values are above average for purchases, cash advance, and especially credit limit. Cluster 2 contains more financially active customers with higher spending, greater access to credit, and stronger borrowing activity. That is, these are larger users; spend more, borrow more, and higher limits.
The cluster summary table provides a clearer understanding of the two customer groups identified by the PAM clustering model.
Cluster 1: Lower Activity Customers
Customers in Cluster 1 have:
Average purchases of 532 currency units
Average cash advance of 442 currency units
Average payments of 906 currency units
Average credit limit of 2,381currency units
Cluster 2: Higher Activity Customers
Customers in Cluster 2 have:
Average purchases of 1,917 currency units
Average cash advance of 1,992 currency units
Average payments of 3,372 currency units
Average credit limit of 8,392currency units
This segmentation can help banks design different products, repayment strategies, and risk monitoring systems for each customer group.
Interactive Figures
1. Purchases vs Balance
The interactive scatter plot shows the relationship between purchases and balance. Most customers have lower purchases and lower balances, while a smaller number of customers have very high purchases and balances. The interactive feature allows us to hover and zoom in on specific observations, explore outliers, and crowded regions more clearly than static charts.
## Figure 1: Purchases vs Balance ###install.packages("plotly")library(plotly)p1 <-ggplot(data = cc_clean) +geom_point(aes(x = PURCHASES, y = BALANCE)) +labs(title ="Interactive Purchases vs Balance",x ="Purchases",y ="Balance" )ggplotly(p1)
2. Cash Advance vs Balance
The interactive scatter plot shows the relationship between cash advance and balance. Most customers have low cash advance usage, while a smaller group uses large cash advances and tends to have higher balances. The interactive view helps identify extreme borrowers more clearly.
## Figure 2: Cash Advance vs Balance ##p2 <-ggplot(data = cc_clean) +geom_point(aes(x = CASH_ADVANCE, y = BALANCE)) +labs(title ="Interactive Cash Advance vs Balance",x ="Cash Advance",y ="Balance" )ggplotly(p2)
The interactive table displays customer IDs, assigned cluster groups, and major financial variables such as purchases, cash advance, payments, and credit limit. This table allows us to easily sort columns, search customers, and browse multiple pages. This makes it easier to inspect customer segments in detail.
Conclusion
This project analyzed customer credit card behavior using the CC GENERAL dataset to identify the key drivers of credit card debt and segment customers into meaningful groups. The regression analysis showed that cash advance usage, purchases, credit limit, and tenure were positively associated with higher balances, while higher payments were associated with lower balances. The clustering analysis identified two main customer groups:
Cluster 1: Lower activity customers with smaller spending, borrowing, and credit limits.
Cluster 2: Higher activity customers with larger spending, borrowing, payments, and credit limits.
These findings show that customer debt behavior differs significantly across individuals and can be modeled effectively using financial activity variables.
Recommendations
For Banks / Financial Institutions
Monitor customers with high cash advance usage, as they are more likely to carry higher debt balances.
Offer tailored products for different customer clusters:
Standard cards for Cluster 1
Premium rewards cards for Cluster 2
Encourage larger and more consistent payments to reduce balances.
Use cluster segmentation for targeted marketing and risk management.
For Customers
Limit unnecessary cash advances due to their association with higher debt.
Make timely payments above the minimum payment when possible.
Use available credit responsibly to avoid excessive balances.
Acknowledgement
The findings presented in this project are exclusive to this course and were not in this or previous semesters, and will not be presented in any other courses during this semester.