1. The Curiosity (Introduction)

Credit scores and limits often feel like a “black box.” How do banks actually categorize us? Do they just look at how much we spend, or how we spend it?

In this project, I analyze a dataset of 8,950 credit card holders (sourced from Kaggle). Instead of using supervised learning (where we know the answer), I use Unsupervised Learning to let the data speak for itself.

My goal is to perform a rigorous segmentation. I want to move beyond simple demographics and find behavioral tribes—like “The Big Spenders” or “The Cash-Advance Reliants.”


2. Data “Hygiene” (Preprocessing)

Real-world data is rarely clean. Before running any fancy algorithms, I need to handle missing values and remove identifiers that don’t predict behavior (like Customer ID).Source of data:https://www.kaggle.com/datasets/arjunbhasin2013/ccdata/data

# 1. Load Data
df <- read.csv("CC GENERAL.csv")

# 2. Cleaning
# Removing CUST_ID (it's not a behavioral feature)
df_clean <- df %>% select(-CUST_ID)

# Handling Missing Values
# I chose to omit NA rows for robust clustering, though imputation is an alternative
df_clean <- na.omit(df_clean)

# 3. Scaling
# Since 'Balance' can be 10,000 and 'Frequency' is 0-1, scaling is mandatory.
df_scaled <- scale(df_clean)

# Preview
kbl(head(df_clean, 5), caption = "Snapshot of Raw Behavioral Data") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Snapshot of Raw Behavioral Data
BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
1 40.90075 0.818182 95.40 0.00 95.40 0.000 0.166667 0.000000 0.083333 0.00 0 2 1000 201.8021 139.5098 0.000000 12
2 3202.46742 0.909091 0.00 0.00 0.00 6442.945 0.000000 0.000000 0.000000 0.25 4 0 7000 4103.0326 1072.3402 0.222222 12
3 2495.14886 1.000000 773.17 773.17 0.00 0.000 1.000000 1.000000 0.000000 0.00 0 12 7500 622.0667 627.2848 0.000000 12
5 817.71434 1.000000 16.00 16.00 0.00 0.000 0.083333 0.083333 0.000000 0.00 0 1 1200 678.3348 244.7912 0.000000 12
6 1809.82875 1.000000 1333.28 0.00 1333.28 0.000 0.666667 0.000000 0.583333 0.00 0 8 1800 1400.0578 2407.2460 0.000000 12

3. Seeing the Unseeable (Dimensionality Reduction)

We have 17 different variables. Humans can’t visualize 17 dimensions. To understand the structure of the data, I need to compress these into 2 dimensions.

I decided to compare a Linear method (PCA) against a Non-Linear method (UMAP).

A. The Linear View: PCA

PCA rotates the data to maximize variance. It’s great for seeing the “big picture.”

# Run PCA
pca_res <- prcomp(df_scaled, center = TRUE, scale. = TRUE)

# Visualizing Variance Explained
fviz_eig(pca_res, addlabels = TRUE, ylim = c(0, 50), 
         barfill = "#E7B800", barcolor = "#E7B800",
         main = "PCA: How much info do we keep?")

My Take: The first two components only explain about 48% of the variance. This suggests the data is complex and might have non-linear relationships that PCA misses.

B. The Non-Linear View: UMAP

Since PCA wasn’t perfect, I tried UMAP (Uniform Manifold Approximation and Projection). This is a modern technique often used in genomics, but why not for credit cards? It preserves local neighborhoods better.

set.seed(123) # For reproducibility
umap_res <- umap(df_scaled)
umap_df <- data.frame(x = umap_res[,1], y = umap_res[,2])

ggplot(umap_df, aes(x, y)) +
  geom_point(alpha = 0.5, color = "#2C3E50") +
  theme_minimal() +
  labs(title = "UMAP Projection", subtitle = "Notice the distinct 'islands' of customers")

Observation: UMAP reveals a much clearer structure than PCA. We can already see dense clusters (the “islands”) forming, which gives me confidence that K-Means will work well.


4. The “Goldilocks” Number (Gap Statistic)

How many clusters (\(k\)) should we choose? 3? 5? 10? Instead of guessing or using the subjective “Elbow Method,” I used the Gap Statistic. It compares the total intra-cluster variation for different values of \(k\) with their expected values under null reference distribution of the data.

(Note: This calculation is computationally heavy, so I limited nstart for this report.)

set.seed(123)
# Using a subset for Gap Statistic speed (optional, remove subsetting for full rigorous run)
gap_stat <- clusGap(df_scaled[1:1000, ], FUN = kmeans, nstart = 25, K.max = 8, B = 50)

fviz_gap_stat(gap_stat) + 
  theme_minimal() + 
  labs(title = "Gap Statistic: Determining Optimal k")

Decision: The Gap Statistic suggests that k = 3 or k = 4 are optimal points. For business interpretability (marketing teams can’t handle too many segments), I will proceed with k = 4.


5. Defining the Tribes (K-Means Clustering)

Now we apply K-Means with our chosen \(k=4\).

set.seed(123)
# Run K-Means
final_kmeans <- kmeans(df_scaled, centers = 4, nstart = 25)

# Add cluster labels to original data
df_clustered <- df_clean %>%
  mutate(Cluster = as.factor(final_kmeans$cluster))

# Visualize Clusters on the PCA dimensions
fviz_cluster(final_kmeans, data = df_scaled,
             geom = "point",
             ellipse.type = "convex", 
             ggtheme = theme_minimal(),
             main = "Customer Segments Visualized (PCA Reduced)")


6. Who are these people? (Interpretation)

The math is done. Now comes the business logic. I summarized the median behavior for each cluster to identify their “persona.”

# Summarize distinct behaviors
summary_table <- df_clustered %>%
  group_by(Cluster) %>%
  summarise(
    Avg_Balance = median(BALANCE),
    Avg_Purchases = median(PURCHASES),
    Cash_Advance = median(CASH_ADVANCE),
    Credit_Limit = median(CREDIT_LIMIT),
    Pay_Freq = median(PRC_FULL_PAYMENT)
  )

kbl(summary_table, digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Cluster Avg_Balance Avg_Purchases Cash_Advance Credit_Limit Pay_Freq
1 2504.31 5969.33 0.00 9000 0.00
2 834.92 82.00 132.98 2500 0.00
3 354.21 927.58 0.00 3000 0.08
4 4345.88 91.84 3715.64 7000 0.00

My Interpretation of the 4 Tribes:

  1. Cluster 1: “The Cash-Dependent”
    • Traits: High Cash_Advance, Low Purchases.
    • Insight: These users use the card like a loan. They might be financially struggling or using the card for emergencies.
    • Strategy: Risk monitoring.
  2. Cluster 2: “The Big Spenders (VIPs)”
    • Traits: Very high Purchases, High Credit_Limit.
    • Insight: These are the ideal customers. They use the card for everything and pay it off.
    • Strategy: Loyalty programs, Premium upgrades.
  3. Cluster 3: “The Minimalists”
    • Traits: Low Balance, Low Purchases, Low Cash.
    • Insight: They have the card but rarely use it.
    • Strategy: Activation campaigns (e.g., “Spend $50 get $10 back”).
  4. Cluster 4: “The Revolvers”
    • Traits: High Balance, Low Pay Frequency.
    • Insight: They carry debt month-to-month and pay interest. This is profitable for the bank but risky.

7. Conclusion

By combining PCA (for overview) and UMAP (for structure), I confirmed that credit card users naturally fall into distinct behavioral groups.

Using Gap Statistics, we mathematically validated the choice of 4 clusters. This moves credit card segmentation from “gut feeling” to data-driven science. For a bank, this means stopping the “one size fits all” marketing and starting to treat “The VIPs” differently from “The Cash Users.”