Uncovering Hidden Subgroups in Single-Class Data Using Unsupervised Learning

Author

Saurabh C Srivastava

Published

February 21, 2025

Objective of the Analysis

The goal of this analysis is to identify hidden subgroups within a single-class dataset using an unsupervised learning approach. Since the dataset contains only one class label, we employ Random Forest-based proximity analysis combined with Partitioning Around Medoids (PAM) clustering to reconstitute a two-label model. By leveraging proximity matrices from Random Forest, we aim to uncover natural patterns and potential subgroups in the data, enabling better segmentation and classification in future applications.

Practical Applications of Identifying Hidden Subgroups in Single-Class Data

The approach of using unsupervised clustering on single-class data can be applied across various domains where the goal is to uncover hidden patterns, anomalies, or subgroups within seemingly uniform datasets. This method is particularly useful when explicit labels for classification are unavailable but meaningful differences exist within the data.

In credit card fraud detection, banks often maintain records of legitimate transactions without explicitly labeled fraudulent cases. However, some transactions may be potentially fraudulent but remain undetected. By leveraging Random Forest proximity analysis and clustering, transactions can be grouped into distinct clusters, identifying those that deviate from normal spending patterns. By analyzing the importance of transaction features such as transaction amount, merchant type, location, and time of day, financial institutions can flag suspicious transactions for manual review, improving fraud detection without needing pre-labeled fraud cases.

Similarly, in recruitment and candidate screening, companies hire employees based on multiple factors, but over time, some candidates excel while others underperform. Without explicit labels distinguishing successful vs. unsuccessful hires, clustering can help segment employees into strong and weak performers based on skills, education, experience, and job performance metrics. This insight allows organizations to refine their hiring process, prioritize key attributes contributing to success, and improve overall workforce quality.

In customer segmentation, businesses often treat all customers the same, but some customers contribute significantly more to revenue than others. By applying unsupervised clustering, customers can be categorized into high-value and low-value segments, helping companies understand spending behaviors, product preferences, and purchasing frequency. Identifying these differences enables businesses to develop personalized marketing campaigns, offer loyalty rewards to engaged customers, and re-engage inactive customers, ultimately leading to higher sales and improved customer retention.

The healthcare industry also benefits from this technique, particularly in early disease detection and preventive care. Hospitals collect vast amounts of patient data, but in many cases, there are no pre-diagnosed labels for conditions such as diabetes or hypertension. By clustering patient records, hospitals can identify high-risk individuals based on key health indicators like BMI, blood sugar levels, and lifestyle factors. This allows for early intervention strategies, personalized treatment plans, and improved healthcare outcomes without needing explicit disease labels upfront.

Overall, unsupervised clustering on single-class data provides a powerful tool for identifying hidden subgroups, anomalies, and meaningful segmentation across industries. Whether in fraud detection, recruitment, customer analytics, or healthcare, this technique enables organizations to make data-driven decisions, uncover latent patterns, and optimize their processes without relying on pre-labeled datasets.

Brief Overview of the Code

1. Loading Libraries and Preparing Data

Loads the dataset (data_OneClass.csv) and removes the Class column, indicating that all data points originally belong to a single label/class.

library(tidyverse)     # Data visualization (ggplot2)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(randomForest)  # Random Forest model & feature importance
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'randomForest'

The following object is masked from 'package:dplyr':

    combine

The following object is masked from 'package:ggplot2':

    margin
library(cluster)       # PAM clustering

mydata <- read.csv('data_OneClass.csv')
mydata$Class <- NULL
head(mydata)
          X1         X2         X3
1 -0.8881322 -1.3058263 -0.2718153
2 -0.3560633 -1.6036864 -1.5013413
3 -0.3960473 -1.0369018 -0.8749633
4 -0.1915692 -0.6358211 -0.7904328
5 -0.3035171 -1.3127854 -1.9480131
6 -0.7959268 -1.2413659 -1.5574404

2. Building an Unsupervised Random Forest Model

  • Trains a Random Forest model using only the first three features (X1, X2, X3).

  • The Random Forest’s proximity matrix (measuring similarity between data points) will later be used for clustering.

  • Since no labels are provided, this is an unsupervised Random Forest model.

rf2 <- randomForest(x = mydata[,1:3])
rf2

Call:
 randomForest(x = mydata[, 1:3]) 
               Type of random forest: unsupervised
                     Number of trees: 500
No. of variables tried at each split: 1

3. Generating the Proximity Matrix & Clustering

  • Extracts the proximity matrix from the Random Forest model, which measures how similar each observation is to others.

  • Applies PAM clustering (Partitioning Around Medoids) with k = 2 clusters.

  • Creates two distinct clusters (0 and 1) from what was originally a single-class dataset.

prox <- rf2$proximity
pam.rf <- pam(prox, 2)
table(pam.rf$clustering)

 1  2 
58 42 

4. Assigning New Cluster Labels

Converts cluster assignments to factor levels (0 and 1) for proper visualization

pam.rf$clustering = as.factor(pam.rf$clustering)
levels(pam.rf$clustering) = c("0","1")

5. Visualizing Clusters

This helps in understanding the separation between clusters.

# Visualize
ggplot(mydata, aes(x = X1, y = X2, col = pam.rf$clustering)) +
  geom_point(size = 5) +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(caption = "Saurabh's Work",
       title = "Clustering from Unsupervised Random Forest",
       color = "Cluster")

6. Feature Importance Analysis

  • Extracts feature importance scores from the Random Forest model.

  • Plots variable importance to show which features (X1, X2, X3) contribute most to the discovered clusters.

importance(rf2)
   MeanDecreaseGini
X1         32.81214
X2         32.03506
X3         32.85512
varImpPlot(rf2)

Conclusion

The analysis successfully restructured the single-class dataset into two distinct subgroups using Random Forest proximity measures and PAM clustering. The results reveal that even in seemingly homogeneous data, inherent patterns and separations can exist, which can be effectively detected using unsupervised learning techniques.