2025-06-18

K-Means Clustering

K-means clustering is a widely-used machine learning algorithm that groups data points into k distnict non-overlapping clusters. The goal behind this algorithm is to partition the data such that the data points within each cluster are as similar as possible, while also maximizing the difference between clusters. It achieves this by iteratively computing a new centroid (mean) for each cluster where each iteration provides either an as-good or better centroid of the data. After enough iterations, the centroids for each cluster eventually converge resulting in the final clusters.

K-means clustering is very helpful when trying to categorize groups of data points. For example, let’s say you have a dataset from a pet store. The dataset consists of the following features: ‘Purchase Date’, ‘Purchase Amount’, ‘Gender’. Using these features, you can apply k-means clustering to determine which customers are dog owners and which ones are cat owners.

Sales Dataset

The following synthetic dataset will be used to apply k-means clustering. It consists of 76000 rows and 16 columns:

  • Date: Date of the record.
  • Store ID: Unique identifier for the store.
  • Product ID: Unique identifier for the product.
  • Category: Product category.
  • Region: Geographical region of the store.
  • Inventory Level: Units available in stock.
  • Units Sold: Units sold on that day.
  • Units Ordered: Units ordered for restocking.
  • Price: Product price.
  • Discount: Discount applied, if any.
  • Weather Condition: Weather on the day of the record.
  • Promotion: 1 if there was a promotion, 0 otherwise.
  • Competitor Pricing: Price of a similar product from a competitor.
  • Seasonality: Season (e.g., Winter, Spring).
  • Epidemic: 1 if an epidemic occurred, 0 otherwise.
  • Demand: Daily estimated demand for the product.

Sales Dataset Cont.

The rows in this dataset are grouped by date, store ID, and product ID. For each day and for each store, there is a record for each product containing information such as how much of that product was sold, what the product’s inventory count is, etc.

K-Means Clustering to Determine Category

The dataset already consists of a column called ‘Category’ which represents the category of each product. I will pretend that no such column exists and use k-means clustering to determine the category of each product. The features to be used will be: ‘Units Sold’ and ‘Price’. I chose these features because there is a clear relationship between the two. To show this I will plot these values for products P0001, P0002, P0005, P0006 and P0010 from store S001.

Note: The reason I chose these products is because each one has a distinct category and it produces a plot that is simple and easy to read.

Units Sold vs Price

Units Sold vs Price Without Knowing Category

Applying K-Means Clustering

The R package named scales makes it easy to apply k-means clustering. First, the data needs to be scaled into a numeric vector. This is done via the function named scale. Then, a seed is set which allows for reproducibility Next, the k-means algorithm is applied via the kmeans function and its results can then be appended to the original dataset.

Note: The two features, ‘Units Sold’ and ‘Price’ will be used to determine the category of product. You will have the option to use additional features and to choose the amount of clusters (k value) to achieve different results.

Another Note: While k-means clustering can help identify the different types of clusters, it cannot determine what exact type each one is. Assigning labels to the types of clusters is up to the one performing the analysis.

Thanks!