1. The Curiosity (Introduction)

Credit scores and limits often feel like a “black box.” How do banks actually categorize us? Do they just look at how much we spend, or how we spend it?

In this project, I analyze a dataset of 8,950 credit card holders (sourced from Kaggle). Instead of using supervised learning (where we know the answer), I use Unsupervised Learning to let the data speak for itself.

My goal is to perform a rigorous segmentation. I want to move beyond simple demographics and find behavioral tribes—like “The Big Spenders” or “The Cash-Advance Reliants.”

2. Data “Hygiene” (Preprocessing)

Real-world data is rarely clean. Before running any fancy algorithms, I need to handle missing values and remove identifiers that don’t predict behavior (like Customer ID).Source of data:https://www.kaggle.com/datasets/arjunbhasin2013/ccdata/data

# 1. Load Data
df <- read.csv("CC GENERAL.csv")

# 2. Cleaning
# Removing CUST_ID (it's not a behavioral feature)
df_clean <- df %>% select(-CUST_ID)

# Handling Missing Values
# I chose to omit NA rows for robust clustering, though imputation is an alternative
df_clean <- na.omit(df_clean)

# 3. Scaling
# Since 'Balance' can be 10,000 and 'Frequency' is 0-1, scaling is mandatory.
df_scaled <- scale(df_clean)

# Preview
kbl(head(df_clean, 5), caption = "Snapshot of Raw Behavioral Data") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Snapshot of Raw Behavioral Data
	BALANCE	BALANCE_FREQUENCY	PURCHASES	ONEOFF_PURCHASES	INSTALLMENTS_PURCHASES	CASH_ADVANCE	PURCHASES_FREQUENCY	ONEOFF_PURCHASES_FREQUENCY	PURCHASES_INSTALLMENTS_FREQUENCY	CASH_ADVANCE_FREQUENCY	CASH_ADVANCE_TRX	PURCHASES_TRX	CREDIT_LIMIT	PAYMENTS	MINIMUM_PAYMENTS	PRC_FULL_PAYMENT	TENURE
1	40.90075	0.818182	95.40	0.00	95.40	0.000	0.166667	0.000000	0.083333	0.00	0	2	1000	201.8021	139.5098	0.000000	12
2	3202.46742	0.909091	0.00	0.00	0.00	6442.945	0.000000	0.000000	0.000000	0.25	4	0	7000	4103.0326	1072.3402	0.222222	12
3	2495.14886	1.000000	773.17	773.17	0.00	0.000	1.000000	1.000000	0.000000	0.00	0	12	7500	622.0667	627.2848	0.000000	12
5	817.71434	1.000000	16.00	16.00	0.00	0.000	0.083333	0.083333	0.000000	0.00	0	1	1200	678.3348	244.7912	0.000000	12
6	1809.82875	1.000000	1333.28	0.00	1333.28	0.000	0.666667	0.000000	0.583333	0.00	0	8	1800	1400.0578	2407.2460	0.000000	12

3. Seeing the Unseeable (Dimensionality Reduction)

We have 17 different variables. Humans can’t visualize 17 dimensions. To understand the structure of the data, I need to compress these into 2 dimensions.

I decided to compare a Linear method (PCA) against a Non-Linear method (UMAP).

A. The Linear View: PCA

PCA rotates the data to maximize variance. It’s great for seeing the “big picture.”

# Run PCA
pca_res <- prcomp(df_scaled, center = TRUE, scale. = TRUE)

# Visualizing Variance Explained
fviz_eig(pca_res, addlabels = TRUE, ylim = c(0, 50), 
         barfill = "#E7B800", barcolor = "#E7B800",
         main = "PCA: How much info do we keep?")

My Take: The first two components only explain about 48% of the variance. This suggests the data is complex and might have non-linear relationships that PCA misses.

B. The Non-Linear View: UMAP

Since PCA wasn’t perfect, I tried UMAP (Uniform Manifold Approximation and Projection). This is a modern technique often used in genomics, but why not for credit cards? It preserves local neighborhoods better.

set.seed(123) # For reproducibility
umap_res <- umap(df_scaled)
umap_df <- data.frame(x = umap_res[,1], y = umap_res[,2])

ggplot(umap_df, aes(x, y)) +
  geom_point(alpha = 0.5, color = "#2C3E50") +
  theme_minimal() +
  labs(title = "UMAP Projection", subtitle = "Notice the distinct 'islands' of customers")

Observation: UMAP reveals a much clearer structure than PCA. We can already see dense clusters (the “islands”) forming, which gives me confidence that K-Means will work well.

4. The “Goldilocks” Number (Gap Statistic)

How many clusters ($k$) should we choose? 3? 5? 10? Instead of guessing or using the subjective “Elbow Method,” I used the Gap Statistic. It compares the total intra-cluster variation for different values of $k$ with their expected values under null reference distribution of the data.

(Note: This calculation is computationally heavy, so I limited nstart for this report.)

set.seed(123)
# Using a subset for Gap Statistic speed (optional, remove subsetting for full rigorous run)
gap_stat <- clusGap(df_scaled[1:1000, ], FUN = kmeans, nstart = 25, K.max = 8, B = 50)

fviz_gap_stat(gap_stat) + 
  theme_minimal() + 
  labs(title = "Gap Statistic: Determining Optimal k")

Decision: The Gap Statistic suggests that k = 3 or k = 4 are optimal points. For business interpretability (marketing teams can’t handle too many segments), I will proceed with k = 4.

5. Defining the Tribes (K-Means Clustering)

Now we apply K-Means with our chosen $k=4$.

set.seed(123)
# Run K-Means
final_kmeans <- kmeans(df_scaled, centers = 4, nstart = 25)

# Add cluster labels to original data
df_clustered <- df_clean %>%
  mutate(Cluster = as.factor(final_kmeans$cluster))

# Visualize Clusters on the PCA dimensions
fviz_cluster(final_kmeans, data = df_scaled,
             geom = "point",
             ellipse.type = "convex", 
             ggtheme = theme_minimal(),
             main = "Customer Segments Visualized (PCA Reduced)")

6. Who are these people? (Interpretation)

The math is done. Now comes the business logic. I summarized the median behavior for each cluster to identify their “persona.”

# Summarize distinct behaviors
summary_table <- df_clustered %>%
  group_by(Cluster) %>%
  summarise(
    Avg_Balance = median(BALANCE),
    Avg_Purchases = median(PURCHASES),
    Cash_Advance = median(CASH_ADVANCE),
    Credit_Limit = median(CREDIT_LIMIT),
    Pay_Freq = median(PRC_FULL_PAYMENT)
  )

kbl(summary_table, digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Cluster	Avg_Balance	Avg_Purchases	Cash_Advance	Credit_Limit	Pay_Freq
1	2504.31	5969.33	0.00	9000	0.00
2	834.92	82.00	132.98	2500	0.00
3	354.21	927.58	0.00	3000	0.08
4	4345.88	91.84	3715.64	7000	0.00

My Interpretation of the 4 Tribes:

Cluster 1: “The Cash-Dependent”
- Traits: High Cash_Advance, Low Purchases.
- Insight: These users use the card like a loan. They might be financially struggling or using the card for emergencies.
- Strategy: Risk monitoring.
Cluster 2: “The Big Spenders (VIPs)”
- Traits: Very high Purchases, High Credit_Limit.
- Insight: These are the ideal customers. They use the card for everything and pay it off.
- Strategy: Loyalty programs, Premium upgrades.
Cluster 3: “The Minimalists”
- Traits: Low Balance, Low Purchases, Low Cash.
- Insight: They have the card but rarely use it.
- Strategy: Activation campaigns (e.g., “Spend $50 get $10 back”).
Cluster 4: “The Revolvers”
- Traits: High Balance, Low Pay Frequency.
- Insight: They carry debt month-to-month and pay interest. This is profitable for the bank but risky.

7. Conclusion

By combining PCA (for overview) and UMAP (for structure), I confirmed that credit card users naturally fall into distinct behavioral groups.

Using Gap Statistics, we mathematically validated the choice of 4 clusters. This moves credit card segmentation from “gut feeling” to data-driven science. For a bank, this means stopping the “one size fits all” marketing and starting to treat “The VIPs” differently from “The Cash Users.”

Who Holds the Cards? Unveiling Credit Behaviors

A Comparative Analysis using PCA, UMAP, and K-Means Clustering

Xinyan Su

2026-01-31