In the modern financial landscape, understanding customer behavior is critical for personalized marketing and risk management. This project applies Unsupervised Learning techniques to a dataset of approximately 9,000 credit card holders to identify distinct behavioral segments. By reducing the dimensionality of 18 complex variables, we aim to uncover the underlying patterns that define The Shopper, The Borrower, and The VIP.
The dataset used in this project was found on kaggle. The dataset tracks 18 behavioral attributes over a six-month period. Key features include:
BALANCE: The amount left on the account to be paid.
PURCHASES: Total amount of purchases made.
CASH_ADVANCE: Cash in advance given by the user.
PURCHASES_FREQUENCY: How frequently purchases are being made (0 to 1).
TENURE: Length of the credit card service for the user.
To achieve high-quality segmentation, we utilize a multi-stage analytical pipeline:
PCA is a linear dimensionality reduction technique that transforms a large set of variables into a smaller one that still contains most of the original information.
The Goal: To simplify the dataset by identifying the underlying drivers of variance and removing redundant or highly correlated features.
In this Project: We reduced 17 behavioral variables into 5 Principal Components, allowing us to filter out noise while retaining 70.1% of the original information.
t-SNE is a non-linear dimensionality reduction tool specifically designed for high-dimensional data visualization.
The Goal: To map complex, multidimensional data into a 2D space where similar points are placed close together in islands.
In this Project: It serves as a visual bridge to verify that the clusters identified by mathematical algorithms actually form distinct, logical groups in 2D space.
A centroid-based algorithm that partitions data into \(K\) pre-defined, non-overlapping groups.
The Goal: To assign every customer to a specific segment based on their proximity to the cluster’s average behavior.
In this Project: It was applied to the PCA-reduced data to define our four primary business personas: Transactors, Revolvers, VIPs, and Inactive users.
A density-based clustering algorithm that groups points that are closely packed together.
The Goal: Unlike K-Means, DBSCAN can find clusters of any shape and, most importantly, identifies outliers as noise.
In this Project: It was used for Anomaly Detection, separating the 1-2% of users with extreme financial behaviors (the noise) to ensure the main segments remained clean and accurate.
Firstly we need to load some libraries
# read packages
library(tidyverse) # Data manipulation
library(factoextra) # Elegant PCA & Cluster visuals
library(FactoMineR) # PCA analysis
library(Rtsne) # Non-linear dimensionality reduction
library(dbscan) # Density-based clustering (outlier detection)
library(cluster) # Clustering algorithms
library(corrplot) # Correlation heatmaps
library(ggplot2)Loading the dataset and removing ID column as we do not need it.
Before proceeding with dimensionality reduction, it is essential to evaluate the distribution and quality of the raw data. The summary statistics below reveal two critical issues that must be addressed: extreme skewness and missing information.
| Statistic | Balance | Purchases | Cash Advance | Min Payments | Limit | Purchases Frequency |
|---|---|---|---|---|---|---|
| Minimum | 0.0 | 0.00 | 0.0 | 0.019 | 50 | 0.0000 |
| 1st Quartile | 128.3 | 39.63 | 0.0 | 169.124 | 1600 | 0.0833 |
| Median | 873.4 | 361.28 | 0.0 | 312.344 | 3000 | 0.5000 |
| Mean | 1564.5 | 1003.21 | 978.9 | 864.207 | 4494 | 0.4900 |
| 3rd Quartile | 2054.1 | 1110.13 | 1113.8 | 825.485 | 6500 | 0.9167 |
| Maximum | 19043.1 | 49039.57 | 47137.2 | 76406.208 | 30000 | 1.0000 |
| NAs | 0.0 | 0.00 | 0.0 | 313.000 | 1 | 0.0000 |
To provide further evidence, the histogram of BALANCE shows that the dataset is indeed heavily right-skewed.
Figure 1. Histogram of Balance
Looking at the summary of the credit card attributes, we identify two main issues that are mentioned above:
Missing Values (NAs): The dataset contains 313 NA values in the MINIMUM_PAYMENTS column and 1 NA value in CREDIT_LIMIT.
Data Scaling: The statistics summary also highlights a massive disparity in the scales of the variables.
Missing values
df_raw$MINIMUM_PAYMENTS[is.na(df_raw$MINIMUM_PAYMENTS)] <- median(df_raw$MINIMUM_PAYMENTS, na.rm = TRUE)
df_raw$CREDIT_LIMIT[is.na(df_raw$CREDIT_LIMIT)] <- median(df_raw$CREDIT_LIMIT, na.rm = TRUE)As mentioned above, medians are used to fill the NAs.
Data Scaling
As mentioned above, scaling is mandatory for PCA/t-SNE. Moreover, financial data has huge variance.
PCA helps us understand the primary axes of customer behavior. The objective is to reduce 17 variables into key behavioral drivers.
To see how many components to keep, a Scree Plot was generated.
Figure 2. Scree Plot The
results indicate that the first two dimensions account for nearly 48% of
the total variance. Following the Elbow Method, I selected the
first 5 dimensions for subsequent clustering (K-Means and
DBSCAN), as they capture over 70% of the dataset’s information while
significantly reducing computational noise.
What drives the variances?
This visualization identifies which raw variables contribute most to our new dimensions.
Figure 3. Variables Factor Map Dimension 1 (27.3% of Variance): The
Spending Power
This horizontal axis is driven by how much and how often a customer actually uses the card for shopping.
Key Drivers
What it tells us?
Dimension 2 (20.3% of Variance): The Debt & Cash Intensity
This vertical axis is driven by how a customer manages their balance and if they use the card as an ATM (Cash Advances).
Key Drivers
What it tells us?
Variables Correlations
The angles between the arrows tell a story about customer behavior:
By combining these two dimensions, we can see four potential segments:
Top-Right: High spenders who also carry balances (The VIB - Very Important Borrowers).
Bottom-Right: High-frequency shoppers who pay off installments (The Smart Shoppers).
Top-Left: People who rarely shop but use the card for cash (The Emergency Users).
Center/Bottom-Left: Inactive or low-usage accounts.
While PCA provided a solid linear foundation for dimensionality reduction, I implemented t-SNE to capture the non-linear manifold of the customer data. By prioritizing local neighborhood structures over global variance, t-SNE allows us to visualize distinct behavioral islands that PCA might obscure. This step is crucial for verifying that the clusters we find later (using K-Means or DBSCAN) are naturally separated and not just mathematical artifacts.
The initial projection reveals a diverse landscape of customer groups, indicating that the dataset contains strong underlying structures rather than a uniform distribution.
# Visualizing the natural groupings found by t-SNE
ggplot(tsne_plot, aes(x = tsne1, y = tsne2)) +
geom_point(color = "skyblue", alpha = 0.5) +
theme_minimal() +
labs(title = "t-SNE Visualization of Customer Neighborhoods",
subtitle = "Identifying natural groupings in high-dimensional financial data")Figure 4. t-SNE Visualization of Customer
Neighbohoods
Observation
The visualization displays distinct islands and dense regions, confirming that credit card users form specific behavioral clusters.
Neighborhood Consistency
Points appearing close together represent users with nearly identical spending and borrowing habits in the original high-dimensional space.
To interpret the financial logic of these clusters, we overlaid the raw intensity of the Cash Advance variable onto the t-SNE coordinates.
# Overlaying raw Cash Advance data to interpret the clusters
tsne_plot$Cash_Advance <- df_raw$CASH_ADVANCE
# Generating the behavioral heatmap using a log scale
ggplot(tsne_plot, aes(x = tsne1, y = tsne2, color = log1p(Cash_Advance))) +
geom_point(alpha = 0.6) +
scale_color_viridis_c(option = "magma") +
theme_minimal() +
labs(title = "Behavioral Heatmap: Cash Advance Intensity",
subtitle = "Overlaying raw features on t-SNE coordinates",
color = "Log(Amount)")Figure 5. t-SNE Visualization of Customer
Neighbohoods
Observation
The bright magma regions (high intensity) align perfectly with specific islands, indicating regions dominated by heavy borrowers.
Validation
This proves that our dimensionality reduction successfully preserved critical business features like cash-borrowing intensity.
Finally, we projected the K-Means labels found in the 5D Principal Component space back onto the 2D t-SNE manifold to verify our segmentation.
# Color by cluster to verify spatial separation
ggplot(tsne_plot, aes(x = tsne1, y = tsne2, color = cluster)) +
geom_point(size = 1.5, alpha = 0.7) +
scale_color_brewer(palette = "Set1") +
theme_minimal() +
labs(title = "t-SNE Map colored by K-Means Clusters",
subtitle = "Verification of cluster separation in 2D space")Figure 6. t-SNE Map colored by K-Means Clusters Conclusion
The clear spatial separation of colors into specific islands validates that the K-Means algorithm successfully captured the same behavioral structures visualized by t-SNE. This provides a high level of confidence in our final four-cluster model.
Finally, we apply our clustering algorithms to the reduced data.
To determine the ideal number of segments, the Average Silhouette Method was applied. This metric measures how well each point fits into its assigned cluster compared to others.
Figure 7. Optimal Number of Clusters Observation: The silhouette
score reaches its global maximum at k = 4.
Result: A 4-cluster solution was selected as it provides the most mathematically distinct separation of customer behaviors.
Execution: K-Means was performed on the top 5 Principal Components using 25 random starts to ensure the stability of the centroids.
Financial datasets often contain Whales or irregular users whose extreme spending habits can distort traditional clusters. Unlike K-Means, DBSCAN identifies these as outliers.
Figure 8.K-NN Distance Plot
Using a K-NN Distance Plot, we identify the knee where the
distance suddenly increases. A threshold of Epsilon (eps) = 1.5 was
chosen to separate core points from noise.
Objective: To label users that do not belong to any dense behavioral neighborhood as “Outliers,” effectively isolating anomalies for separate business review.
To focus purely on the density of the customer base, we can visualize the dataset using only the DBSCAN results.
# Plotting DBSCAN results only to highlight density and noise
tsne_plot$dbscan_cluster <- as.factor(db_res$cluster)
ggplot(tsne_plot, aes(x = tsne1, y = tsne2, color = dbscan_cluster)) +
geom_point(alpha = 0.5) +
scale_color_viridis_d() +
theme_minimal() +
labs(title = "DBSCAN Density-Based Map",
subtitle = "Cluster 0 represents detected behavioral noise/outliers",
color = "DBSCAN Cluster")Figure 9.DBSCAN Density-Based Map This view ignores pre-defined cluster counts and
shows only the natural dense populations. Cluster 0 (typically
the darkest color) highlights the irregular users who are the most
different from the rest of the customer base.
The final step projects both the K-Means clusters (colors) and the DBSCAN outliers (shapes) onto the t-SNE manifold for visual verification.
# Add cluster labels to t-SNE coordinates for a map
tsne_plot$cluster_km <- as.factor(km_res$cluster)
tsne_plot$is_outlier <- as.factor(ifelse(db_res$cluster == 0, "Outlier", "Normal"))
# Plotting K-means results on a t-SNE manifold
ggplot(tsne_plot, aes(x = tsne1, y = tsne2, color = cluster_km, shape = is_outlier)) +
geom_point(alpha = 0.6) +
theme_minimal() +
labs(title = "t-SNE Projection of Credit Card Segments",
subtitle = "Colors = K-Means Clusters | Shapes = DBSCAN Outlier Detection",
x = "t-SNE Dimension 1", y = "t-SNE Dimension 2")Figure 10.t-SNE Projection of Credit Card Segments
Interpretation: The spatial
separation of the 4 colors confirms that our segments occupy distinct
islands in the behavioral space.
Noise Identification: The triangle shapes represent the outliers discovered by DBSCAN. Notice how these are often located at the edges of the manifold or in less dense areas, proving they represent unique, non-patterned behaviors.
The integration of PCA, t-SNE, and DBSCAN has allowed us to transform 17 complex variables into four clearly defined customer personas. This segmentation provides a strategic roadmap for personalized marketing and risk management.
After identifying 4 distinct clusters and filtering noise with DBSCAN, we calculated the average behavioral metrics to define our final business personas.
| Cluster | Persona | Count | Balance | Purchases | Cash Adv | Freq. | Limit |
|---|---|---|---|---|---|---|---|
| 1 | The Inactive | 3986 | 1003 | 279 | 565 | 0.178 | 3351 |
| 2 | The VIP Shoppers | 461 | 3408 | 7252 | 691 | 0.948 | 9784 |
| 3 | The Active Transactors | 3265 | 890 | 1210 | 210 | 0.889 | 4076 |
| 4 | The Revolving Borrowers | 1238 | 4465 | 463 | 4447 | 0.273 | 7309 |
| Cluster | Persona Name | Primary Financial Behavior | Strategy Focus |
|---|---|---|---|
| Cluster 1 | The Transactors | High purchase frequency with low balances. They pay in full. | Loyalty & Rewards |
| Cluster 2 | The VIPs | High credit limits and extreme spending volume. | Premium Services |
| Cluster 3 | The Inactive | Low usage across all metrics; potentially “sleeping” accounts. | Re-engagement |
| Cluster 4 | The Revolvers | High balance and heavy reliance on Cash Advances. | Debt Management |
By utilizing DBSCAN, we successfully filtered out anomalous noise (The Outliers), ensuring that the personas described above are stable and representative of the majority of the customer base. The use of t-SNE provided a visual validation that these segments are not just mathematical artifacts but real, distinct neighborhoods of financial behavior.