Credit scores and limits often feel like a “black box.” How do banks actually categorize us? Do they just look at how much we spend, or how we spend it?
In this project, I analyze a dataset of 8,950 credit card holders (sourced from Kaggle). Instead of using supervised learning (where we know the answer), I use Unsupervised Learning to let the data speak for itself.
My goal is to perform a rigorous segmentation. I want to move beyond simple demographics and find behavioral tribes—like “The Big Spenders” or “The Cash-Advance Reliants.”
Real-world data is rarely clean. Before running any fancy algorithms, I need to handle missing values and remove identifiers that don’t predict behavior (like Customer ID).Source of data:https://www.kaggle.com/datasets/arjunbhasin2013/ccdata/data
# 1. Load Data
df <- read.csv("CC GENERAL.csv")
# 2. Cleaning
# Removing CUST_ID (it's not a behavioral feature)
df_clean <- df %>% select(-CUST_ID)
# Handling Missing Values
# I chose to omit NA rows for robust clustering, though imputation is an alternative
df_clean <- na.omit(df_clean)
# 3. Scaling
# Since 'Balance' can be 10,000 and 'Frequency' is 0-1, scaling is mandatory.
df_scaled <- scale(df_clean)
# Preview
kbl(head(df_clean, 5), caption = "Snapshot of Raw Behavioral Data") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| BALANCE | BALANCE_FREQUENCY | PURCHASES | ONEOFF_PURCHASES | INSTALLMENTS_PURCHASES | CASH_ADVANCE | PURCHASES_FREQUENCY | ONEOFF_PURCHASES_FREQUENCY | PURCHASES_INSTALLMENTS_FREQUENCY | CASH_ADVANCE_FREQUENCY | CASH_ADVANCE_TRX | PURCHASES_TRX | CREDIT_LIMIT | PAYMENTS | MINIMUM_PAYMENTS | PRC_FULL_PAYMENT | TENURE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 40.90075 | 0.818182 | 95.40 | 0.00 | 95.40 | 0.000 | 0.166667 | 0.000000 | 0.083333 | 0.00 | 0 | 2 | 1000 | 201.8021 | 139.5098 | 0.000000 | 12 |
| 2 | 3202.46742 | 0.909091 | 0.00 | 0.00 | 0.00 | 6442.945 | 0.000000 | 0.000000 | 0.000000 | 0.25 | 4 | 0 | 7000 | 4103.0326 | 1072.3402 | 0.222222 | 12 |
| 3 | 2495.14886 | 1.000000 | 773.17 | 773.17 | 0.00 | 0.000 | 1.000000 | 1.000000 | 0.000000 | 0.00 | 0 | 12 | 7500 | 622.0667 | 627.2848 | 0.000000 | 12 |
| 5 | 817.71434 | 1.000000 | 16.00 | 16.00 | 0.00 | 0.000 | 0.083333 | 0.083333 | 0.000000 | 0.00 | 0 | 1 | 1200 | 678.3348 | 244.7912 | 0.000000 | 12 |
| 6 | 1809.82875 | 1.000000 | 1333.28 | 0.00 | 1333.28 | 0.000 | 0.666667 | 0.000000 | 0.583333 | 0.00 | 0 | 8 | 1800 | 1400.0578 | 2407.2460 | 0.000000 | 12 |
We have 17 different variables. Humans can’t visualize 17 dimensions. To understand the structure of the data, I need to compress these into 2 dimensions.
I decided to compare a Linear method (PCA) against a Non-Linear method (UMAP).
PCA rotates the data to maximize variance. It’s great for seeing the “big picture.”
# Run PCA
pca_res <- prcomp(df_scaled, center = TRUE, scale. = TRUE)
# Visualizing Variance Explained
fviz_eig(pca_res, addlabels = TRUE, ylim = c(0, 50),
barfill = "#E7B800", barcolor = "#E7B800",
main = "PCA: How much info do we keep?")My Take: The first two components only explain about 48% of the variance. This suggests the data is complex and might have non-linear relationships that PCA misses.
Since PCA wasn’t perfect, I tried UMAP (Uniform Manifold Approximation and Projection). This is a modern technique often used in genomics, but why not for credit cards? It preserves local neighborhoods better.
set.seed(123) # For reproducibility
umap_res <- umap(df_scaled)
umap_df <- data.frame(x = umap_res[,1], y = umap_res[,2])
ggplot(umap_df, aes(x, y)) +
geom_point(alpha = 0.5, color = "#2C3E50") +
theme_minimal() +
labs(title = "UMAP Projection", subtitle = "Notice the distinct 'islands' of customers")Observation: UMAP reveals a much clearer structure than PCA. We can already see dense clusters (the “islands”) forming, which gives me confidence that K-Means will work well.
How many clusters (\(k\)) should we choose? 3? 5? 10? Instead of guessing or using the subjective “Elbow Method,” I used the Gap Statistic. It compares the total intra-cluster variation for different values of \(k\) with their expected values under null reference distribution of the data.
(Note: This calculation is computationally heavy, so I limited
nstart for this report.)
set.seed(123)
# Using a subset for Gap Statistic speed (optional, remove subsetting for full rigorous run)
gap_stat <- clusGap(df_scaled[1:1000, ], FUN = kmeans, nstart = 25, K.max = 8, B = 50)
fviz_gap_stat(gap_stat) +
theme_minimal() +
labs(title = "Gap Statistic: Determining Optimal k")Decision: The Gap Statistic suggests that k = 3 or k = 4 are optimal points. For business interpretability (marketing teams can’t handle too many segments), I will proceed with k = 4.
Now we apply K-Means with our chosen \(k=4\).
set.seed(123)
# Run K-Means
final_kmeans <- kmeans(df_scaled, centers = 4, nstart = 25)
# Add cluster labels to original data
df_clustered <- df_clean %>%
mutate(Cluster = as.factor(final_kmeans$cluster))
# Visualize Clusters on the PCA dimensions
fviz_cluster(final_kmeans, data = df_scaled,
geom = "point",
ellipse.type = "convex",
ggtheme = theme_minimal(),
main = "Customer Segments Visualized (PCA Reduced)")The math is done. Now comes the business logic. I summarized the median behavior for each cluster to identify their “persona.”
# Summarize distinct behaviors
summary_table <- df_clustered %>%
group_by(Cluster) %>%
summarise(
Avg_Balance = median(BALANCE),
Avg_Purchases = median(PURCHASES),
Cash_Advance = median(CASH_ADVANCE),
Credit_Limit = median(CREDIT_LIMIT),
Pay_Freq = median(PRC_FULL_PAYMENT)
)
kbl(summary_table, digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Cluster | Avg_Balance | Avg_Purchases | Cash_Advance | Credit_Limit | Pay_Freq |
|---|---|---|---|---|---|
| 1 | 2504.31 | 5969.33 | 0.00 | 9000 | 0.00 |
| 2 | 834.92 | 82.00 | 132.98 | 2500 | 0.00 |
| 3 | 354.21 | 927.58 | 0.00 | 3000 | 0.08 |
| 4 | 4345.88 | 91.84 | 3715.64 | 7000 | 0.00 |
Cash_Advance, Low
Purchases.Purchases, High
Credit_Limit.By combining PCA (for overview) and UMAP (for structure), I confirmed that credit card users naturally fall into distinct behavioral groups.
Using Gap Statistics, we mathematically validated the choice of 4 clusters. This moves credit card segmentation from “gut feeling” to data-driven science. For a bank, this means stopping the “one size fits all” marketing and starting to treat “The VIPs” differently from “The Cash Users.”