1. Introduction

In the modern financial landscape, understanding customer behavior is critical for personalized marketing and risk management. This project applies Unsupervised Learning techniques to a dataset of approximately 9,000 credit card holders to identify distinct behavioral segments. By reducing the dimensionality of 18 complex variables, we aim to uncover the underlying patterns that define The Shopper, The Borrower, and The VIP.

2. Review of the dataset

The dataset used in this project was found on kaggle. The dataset tracks 18 behavioral attributes over a six-month period. Key features include:

  • BALANCE: The amount left on the account to be paid.

  • PURCHASES: Total amount of purchases made.

  • CASH_ADVANCE: Cash in advance given by the user.

  • PURCHASES_FREQUENCY: How frequently purchases are being made (0 to 1).

  • TENURE: Length of the credit card service for the user.

3. Methods Overview

To achieve high-quality segmentation, we utilize a multi-stage analytical pipeline:

3.1 Principal Component Analysis (PCA)

PCA is a linear dimensionality reduction technique that transforms a large set of variables into a smaller one that still contains most of the original information.

The Goal: To simplify the dataset by identifying the underlying drivers of variance and removing redundant or highly correlated features.

In this Project: We reduced 17 behavioral variables into 5 Principal Components, allowing us to filter out noise while retaining 70.1% of the original information.

3.2 t-Distributed Stochastic Neighbor Embedding (t-SNE)

t-SNE is a non-linear dimensionality reduction tool specifically designed for high-dimensional data visualization.

The Goal: To map complex, multidimensional data into a 2D space where similar points are placed close together in islands.

In this Project: It serves as a visual bridge to verify that the clusters identified by mathematical algorithms actually form distinct, logical groups in 2D space.

3.3 K-Means Clustering

A centroid-based algorithm that partitions data into \(K\) pre-defined, non-overlapping groups.

The Goal: To assign every customer to a specific segment based on their proximity to the cluster’s average behavior.

In this Project: It was applied to the PCA-reduced data to define our four primary business personas: Transactors, Revolvers, VIPs, and Inactive users.

3.4 DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

A density-based clustering algorithm that groups points that are closely packed together.

The Goal: Unlike K-Means, DBSCAN can find clusters of any shape and, most importantly, identifies outliers as noise.

In this Project: It was used for Anomaly Detection, separating the 1-2% of users with extreme financial behaviors (the noise) to ensure the main segments remained clean and accurate.

4. Data Processing

4.1 Libraries and dataset

Firstly we need to load some libraries

# read packages
library(tidyverse)   # Data manipulation
library(factoextra)  # Elegant PCA & Cluster visuals
library(FactoMineR)  # PCA analysis
library(Rtsne)       # Non-linear dimensionality reduction
library(dbscan)      # Density-based clustering (outlier detection)
library(cluster)     # Clustering algorithms
library(corrplot)    # Correlation heatmaps
library(ggplot2)

Loading the dataset and removing ID column as we do not need it.

# Load data
df <- read.csv("CCGENERAL.csv")
df_raw <- df %>%
  select(-CUST_ID)

4.2 Statistics summary

Before proceeding with dimensionality reduction, it is essential to evaluate the distribution and quality of the raw data. The summary statistics below reveal two critical issues that must be addressed: extreme skewness and missing information.

summary(df_raw)
colSums(is.na(df_raw))
Table 1: Descriptive Statistics of the Credit Card Dataset
Financial Behavioral Metrics ($)
Statistic Balance Purchases Cash Advance Min Payments Limit Purchases Frequency
Minimum 0.0 0.00 0.0 0.019 50 0.0000
1st Quartile 128.3 39.63 0.0 169.124 1600 0.0833
Median 873.4 361.28 0.0 312.344 3000 0.5000
Mean 1564.5 1003.21 978.9 864.207 4494 0.4900
3rd Quartile 2054.1 1110.13 1113.8 825.485 6500 0.9167
Maximum 19043.1 49039.57 47137.2 76406.208 30000 1.0000
NAs 0.0 0.00 0.0 313.000 1 0.0000

To provide further evidence, the histogram of BALANCE shows that the dataset is indeed heavily right-skewed.

Figure 1. Histogram of Balance

Looking at the summary of the credit card attributes, we identify two main issues that are mentioned above:

Missing Values (NAs): The dataset contains 313 NA values in the MINIMUM_PAYMENTS column and 1 NA value in CREDIT_LIMIT.

    Justification for Median: In financial datasets, the mean is often heavily distorted by extreme outliers (wealthy whales). For example, while the mean of MINIMUM_PAYMENTS is 864.2, the median is much lower at 312.3, and the maximum reaches a staggering 76,406.2.
    Conclusion: To ensure the imputed values are representative of the typical customer rather than being pulled upward by extreme cases, median imputation was chosen over mean imputation.

Data Scaling: The statistics summary also highlights a massive disparity in the scales of the variables.

    Feature Ranges: Variables like BALANCE range from 0 to 19,043, whereas frequency-based variables like PURCHASES_FREQUENCY only range from 0 to 1.
    Impact on PCA/Clustering: Without scaling, algorithms like PCA would incorrectly assume that BALANCE is thousands of times more important than PURCHASES_FREQUENCY.
    Conclusion: To prevent high-magnitude features from dominating the model, all data was standardized (scaled) to ensure each variable contributes equally to the dimension reduction process.

4.3 Preparation

Missing values

df_raw$MINIMUM_PAYMENTS[is.na(df_raw$MINIMUM_PAYMENTS)] <- median(df_raw$MINIMUM_PAYMENTS, na.rm = TRUE)
df_raw$CREDIT_LIMIT[is.na(df_raw$CREDIT_LIMIT)] <- median(df_raw$CREDIT_LIMIT, na.rm = TRUE)

As mentioned above, medians are used to fill the NAs.

Data Scaling

As mentioned above, scaling is mandatory for PCA/t-SNE. Moreover, financial data has huge variance.

df_scaled <- scale(df_raw)

5. Dimension Reduction PCA

PCA helps us understand the primary axes of customer behavior. The objective is to reduce 17 variables into key behavioral drivers.

pca_res <- PCA(df_scaled, graph = FALSE)

5.1 Scree Plot

To see how many components to keep, a Scree Plot was generated.

Figure 2. Scree Plot The results indicate that the first two dimensions account for nearly 48% of the total variance. Following the Elbow Method, I selected the first 5 dimensions for subsequent clustering (K-Means and DBSCAN), as they capture over 70% of the dataset’s information while significantly reducing computational noise.

5.2 Variable Contributions

What drives the variances?

This visualization identifies which raw variables contribute most to our new dimensions.

Figure 3. Variables Factor Map Dimension 1 (27.3% of Variance): The Spending Power

This horizontal axis is driven by how much and how often a customer actually uses the card for shopping.

Key Drivers

  • PURCHASES
  • ONEOFF_PURCHASES
  • PURCHASES_TRX
  • PURCHASES_FREQUENCY

What it tells us?

  • Customers on the far right are active shoppers who use their cards for transactions.
  • Customers on the left are likely inactive or use the card for purposes other than direct shopping.

Dimension 2 (20.3% of Variance): The Debt & Cash Intensity

This vertical axis is driven by how a customer manages their balance and if they use the card as an ATM (Cash Advances).

Key Drivers

  • CASH_ADVANCE
  • CASH_ADVANCE_FREQUENCY
  • CASH_ADVANCE_TRX
  • BALANCE

What it tells us?

  • Customers at the top of the chart are Cash-Heavy users who often take cash out and carry high balances.
  • Customers at the bottom likely focus more on installment plans or pay off their purchases more regularly without relying on cash advances.

Variables Correlations

The angles between the arrows tell a story about customer behavior:

  • High Correlation (Close Arrows): CASH_ADVANCE and BALANCE point in the same direction. This suggests that people who take out cash advances are also the ones carrying the highest debt balances.
  • Negative Correlation (Opposite Directions): Notice how PURCHASES_FREQUENCY (bottom right) is somewhat opposite to CASH_ADVANCE_FREQUENCY (top left). This shows two distinct types of card users: The Shopper vs. The Cash Borrower.
  • Weak Correlation (90-degree Angle): TENURE (length of time as a customer) is nearly at a right angle to the spending variables, meaning a customer’s loyalty doesn’t necessarily predict if they are a high spender or a cash borrower.

5.3 Business Insights

By combining these two dimensions, we can see four potential segments:

  1. Top-Right: High spenders who also carry balances (The VIB - Very Important Borrowers).

  2. Bottom-Right: High-frequency shoppers who pay off installments (The Smart Shoppers).

  3. Top-Left: People who rarely shop but use the card for cash (The Emergency Users).

  4. Center/Bottom-Left: Inactive or low-usage accounts.

6. Visualization (t-sne)

While PCA provided a solid linear foundation for dimensionality reduction, I implemented t-SNE to capture the non-linear manifold of the customer data. By prioritizing local neighborhood structures over global variance, t-SNE allows us to visualize distinct behavioral islands that PCA might obscure. This step is crucial for verifying that the clusters we find later (using K-Means or DBSCAN) are naturally separated and not just mathematical artifacts.

6.1 t-SNE Manifold Visualization

The initial projection reveals a diverse landscape of customer groups, indicating that the dataset contains strong underlying structures rather than a uniform distribution.

# Visualizing the natural groupings found by t-SNE
ggplot(tsne_plot, aes(x = tsne1, y = tsne2)) +
  geom_point(color = "skyblue", alpha = 0.5) +
  theme_minimal() +
  labs(title = "t-SNE Visualization of Customer Neighborhoods",
       subtitle = "Identifying natural groupings in high-dimensional financial data")

Figure 4. t-SNE Visualization of Customer Neighbohoods

Observation

The visualization displays distinct islands and dense regions, confirming that credit card users form specific behavioral clusters.

Neighborhood Consistency

Points appearing close together represent users with nearly identical spending and borrowing habits in the original high-dimensional space.

6.2 Behavioral Heatmap: Cash Advance Intensity

To interpret the financial logic of these clusters, we overlaid the raw intensity of the Cash Advance variable onto the t-SNE coordinates.

# Overlaying raw Cash Advance data to interpret the clusters
tsne_plot$Cash_Advance <- df_raw$CASH_ADVANCE

# Generating the behavioral heatmap using a log scale
ggplot(tsne_plot, aes(x = tsne1, y = tsne2, color = log1p(Cash_Advance))) + 
  geom_point(alpha = 0.6) +
  scale_color_viridis_c(option = "magma") + 
  theme_minimal() +
  labs(title = "Behavioral Heatmap: Cash Advance Intensity",
       subtitle = "Overlaying raw features on t-SNE coordinates",
       color = "Log(Amount)")

Figure 5. t-SNE Visualization of Customer Neighbohoods Observation

The bright magma regions (high intensity) align perfectly with specific islands, indicating regions dominated by heavy borrowers.

Validation

This proves that our dimensionality reduction successfully preserved critical business features like cash-borrowing intensity.

6.3 Cluster Validation (K-Means + t-SNE)

Finally, we projected the K-Means labels found in the 5D Principal Component space back onto the 2D t-SNE manifold to verify our segmentation.

# Color by cluster to verify spatial separation
ggplot(tsne_plot, aes(x = tsne1, y = tsne2, color = cluster)) +
  geom_point(size = 1.5, alpha = 0.7) +
  scale_color_brewer(palette = "Set1") + 
  theme_minimal() +
  labs(title = "t-SNE Map colored by K-Means Clusters",
       subtitle = "Verification of cluster separation in 2D space")

Figure 6. t-SNE Map colored by K-Means Clusters Conclusion

The clear spatial separation of colors into specific islands validates that the K-Means algorithm successfully captured the same behavioral structures visualized by t-SNE. This provides a high level of confidence in our final four-cluster model.

7. Clustering Results

Finally, we apply our clustering algorithms to the reduced data.

7.1 Finding the Optimal Number of Clusters (K-Means)

To determine the ideal number of segments, the Average Silhouette Method was applied. This metric measures how well each point fits into its assigned cluster compared to others.

Figure 7. Optimal Number of Clusters Observation: The silhouette score reaches its global maximum at k = 4.

Result: A 4-cluster solution was selected as it provides the most mathematically distinct separation of customer behaviors.

Execution: K-Means was performed on the top 5 Principal Components using 25 random starts to ensure the stability of the centroids.

7.2 Density-Based Outlier Detection (DBSCAN)

Financial datasets often contain Whales or irregular users whose extreme spending habits can distort traditional clusters. Unlike K-Means, DBSCAN identifies these as outliers.

Figure 8.K-NN Distance Plot Using a K-NN Distance Plot, we identify the knee where the distance suddenly increases. A threshold of Epsilon (eps) = 1.5 was chosen to separate core points from noise.

Objective: To label users that do not belong to any dense behavioral neighborhood as “Outliers,” effectively isolating anomalies for separate business review.

7.3 DBSCAN Density Map

To focus purely on the density of the customer base, we can visualize the dataset using only the DBSCAN results.

# Plotting DBSCAN results only to highlight density and noise
tsne_plot$dbscan_cluster <- as.factor(db_res$cluster)

ggplot(tsne_plot, aes(x = tsne1, y = tsne2, color = dbscan_cluster)) +
  geom_point(alpha = 0.5) +
  scale_color_viridis_d() +
  theme_minimal() +
  labs(title = "DBSCAN Density-Based Map",
       subtitle = "Cluster 0 represents detected behavioral noise/outliers",
       color = "DBSCAN Cluster")

Figure 9.DBSCAN Density-Based Map This view ignores pre-defined cluster counts and shows only the natural dense populations. Cluster 0 (typically the darkest color) highlights the irregular users who are the most different from the rest of the customer base.

7.4 Cluster Validation on the t-SNE Manifold

The final step projects both the K-Means clusters (colors) and the DBSCAN outliers (shapes) onto the t-SNE manifold for visual verification.

# Add cluster labels to t-SNE coordinates for a map
tsne_plot$cluster_km <- as.factor(km_res$cluster)
tsne_plot$is_outlier <- as.factor(ifelse(db_res$cluster == 0, "Outlier", "Normal"))

# Plotting K-means results on a t-SNE manifold
ggplot(tsne_plot, aes(x = tsne1, y = tsne2, color = cluster_km, shape = is_outlier)) +
  geom_point(alpha = 0.6) +
  theme_minimal() +
  labs(title = "t-SNE Projection of Credit Card Segments",
       subtitle = "Colors = K-Means Clusters | Shapes = DBSCAN Outlier Detection",
       x = "t-SNE Dimension 1", y = "t-SNE Dimension 2")

Figure 10.t-SNE Projection of Credit Card Segments Interpretation: The spatial separation of the 4 colors confirms that our segments occupy distinct islands in the behavioral space.

Noise Identification: The triangle shapes represent the outliers discovered by DBSCAN. Notice how these are often located at the edges of the manifold or in less dense areas, proving they represent unique, non-patterned behaviors.

8. Conclusion

The integration of PCA, t-SNE, and DBSCAN has allowed us to transform 17 complex variables into four clearly defined customer personas. This segmentation provides a strategic roadmap for personalized marketing and risk management.

8.1 Behavioral Profiles

After identifying 4 distinct clusters and filtering noise with DBSCAN, we calculated the average behavioral metrics to define our final business personas.

Table 2: Final Behavioral Profiles per Segment
Average Financial Values ($)
Cluster Persona Count Balance Purchases Cash Adv Freq. Limit
1 The Inactive 3986 1003 279 565 0.178 3351
2 The VIP Shoppers 461 3408 7252 691 0.948 9784
3 The Active Transactors 3265 890 1210 210 0.889 4076
4 The Revolving Borrowers 1238 4465 463 4447 0.273 7309
Cluster Persona Name Primary Financial Behavior Strategy Focus
Cluster 1 The Transactors High purchase frequency with low balances. They pay in full. Loyalty & Rewards
Cluster 2 The VIPs High credit limits and extreme spending volume. Premium Services
Cluster 3 The Inactive Low usage across all metrics; potentially “sleeping” accounts. Re-engagement
Cluster 4 The Revolvers High balance and heavy reliance on Cash Advances. Debt Management

8.2 Technical Summary

By utilizing DBSCAN, we successfully filtered out anomalous noise (The Outliers), ensuring that the personas described above are stable and representative of the majority of the customer base. The use of t-SNE provided a visual validation that these segments are not just mathematical artifacts but real, distinct neighborhoods of financial behavior.