Uncovering Hidden Subgroups in Single-Class Data Using Unsupervised Learning
Author
Saurabh C Srivastava
Published
February 21, 2025
Objective of the Analysis
The goal of this analysis is to identify hidden subgroups within a single-class dataset using an unsupervised learning approach. Since the dataset contains only one class label, we employ Random Forest-based proximity analysis combined with Partitioning Around Medoids (PAM) clustering to reconstitute a two-label model. By leveraging proximity matrices from Random Forest, we aim to uncover natural patterns and potential subgroups in the data, enabling better segmentation and classification in future applications.
Practical Applications of Identifying Hidden Subgroups in Single-Class Data
The approach of using unsupervised clustering on single-class data can be applied across various domains where the goal is to uncover hidden patterns, anomalies, or subgroups within seemingly uniform datasets. This method is particularly useful when explicit labels for classification are unavailable but meaningful differences exist within the data.
In credit card fraud detection, banks often maintain records of legitimate transactions without explicitly labeled fraudulent cases. However, some transactions may be potentially fraudulent but remain undetected. By leveraging Random Forest proximity analysis and clustering, transactions can be grouped into distinct clusters, identifying those that deviate from normal spending patterns. By analyzing the importance of transaction features such as transaction amount, merchant type, location, and time of day, financial institutions can flag suspicious transactions for manual review, improving fraud detection without needing pre-labeled fraud cases.
Similarly, in recruitment and candidate screening, companies hire employees based on multiple factors, but over time, some candidates excel while others underperform. Without explicit labels distinguishing successful vs. unsuccessful hires, clustering can help segment employees into strong and weak performers based on skills, education, experience, and job performance metrics. This insight allows organizations to refine their hiring process, prioritize key attributes contributing to success, and improve overall workforce quality.
In customer segmentation, businesses often treat all customers the same, but some customers contribute significantly more to revenue than others. By applying unsupervised clustering, customers can be categorized into high-value and low-value segments, helping companies understand spending behaviors, product preferences, and purchasing frequency. Identifying these differences enables businesses to develop personalized marketing campaigns, offer loyalty rewards to engaged customers, and re-engage inactive customers, ultimately leading to higher sales and improved customer retention.
The healthcare industry also benefits from this technique, particularly in early disease detection and preventive care. Hospitals collect vast amounts of patient data, but in many cases, there are no pre-diagnosed labels for conditions such as diabetes or hypertension. By clustering patient records, hospitals can identify high-risk individuals based on key health indicators like BMI, blood sugar levels, and lifestyle factors. This allows for early intervention strategies, personalized treatment plans, and improved healthcare outcomes without needing explicit disease labels upfront.
Overall, unsupervised clustering on single-class data provides a powerful tool for identifying hidden subgroups, anomalies, and meaningful segmentation across industries. Whether in fraud detection, recruitment, customer analytics, or healthcare, this technique enables organizations to make data-driven decisions, uncover latent patterns, and optimize their processes without relying on pre-labeled datasets.
Brief Overview of the Code
1. Loading Libraries and Preparing Data
Loads the dataset (data_OneClass.csv) and removes the Class column, indicating that all data points originally belong to a single label/class.
library(tidyverse) # Data visualization (ggplot2)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(randomForest) # Random Forest model & feature importance
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.
Attaching package: 'randomForest'
The following object is masked from 'package:dplyr':
combine
The following object is masked from 'package:ggplot2':
margin
The analysis successfully restructured the single-class dataset into two distinct subgroups using Random Forest proximity measures and PAM clustering. The results reveal that even in seemingly homogeneous data, inherent patterns and separations can exist, which can be effectively detected using unsupervised learning techniques.