Spring 2025 MS Student Updates

Overview

Density-based

Functions

  • db_clust() specification

Engine:

Currently uses DBSCAN.

I’m not aware of other candidate engines.

Parameters:

Two exposed params: radius (eps in engine) and min_pts (MinPts in engine).

Functions in dials:

  • radius: default values between -2 and 2, base log 10 transformation.
  • min_pts: default values between 3 and 50

Changes to algorithm:

Major change: prediction uses only core points, so that prediction output and training output match.

Minor change: points in “overlap zone” are assigned to closest cluster, not first cluster.

Changes to tidyclust framework:

Outliers are now possible. These are denoted by Outlier instead of Cluster_N. In extract_cluster_summary(), the cluster center for the outlier group is deliberately returned as NA, to prevent users from treating the collection of outliers as a cluster.

We should consider if we like this notation. Another option is Cluster_0 (to preserve the “Cluster” name for other functions to work) or even Cluster_0_1, Cluster_0_2, etc. to treat each outlier as its own “cluster of one” for partition purposes.

In principle, overlap points could now be possible, with membership in multiple clusters. This is not implemented as of now - these points are assigned to the nearest cluster - but they could be incorporated if a mode besides partition were added.

Mixture Models

Functions

  • gm_clust() specification

Engine

Currently uses mclust.

Other candidates: Rmixmod, mixture, EMCluster, mixtools

Parameters

Six exposed params num_clusters, and five boolean params to control the multivariate Gaussian mixture model specification:

  • circular
  • zero_covariance
  • shared_orientation
  • shared_shape
  • shared_size

The five boolean inputs replace the mclust specification style by model “name”, e.g. EII, EVE etc. This is a big API change, but one that allows the function to be more intuitive and self-documenting.

In all cases, values of TRUE indicate a more restricted model. Default values are TRUE; which means only two parameters, circular and shared_size are used.

In dials:

  • num_clusters() was already present and is unchanged.
  • The five boolean inputs give c(TRUE, FALSE) for tuning as expected.
  • (Note that because not every combination of boolean parameters is a viable model, tuning an exhaustive grid results in repeating some of the model types.)

Changes to algorithm:

The mclust() function will fit all supplied MV Gaussian model types.
The gm_clust() specification only fits one Gaussian model; and expects tuning structure to be used to fit multiple.

BIRCH

Engine

Currently: BIRCH.

No others we know of.

Parameters

  • radius_threshold: The maximum radius allowed for a subcluster within a leaf node of the CF-tree. (Default 0.5)

  • branching_factor: The maximum number of child nodes in a non-leaf node of the CF-tree. (Default 50)

  • max_leaf: The maximum number of microclusters or CF entries allowed in a leaf node. (Default 100).

  • global_method: An additional clustering to be applied to the microclusters produced. Currently only “hier_clust” and “k_means” are available. Default: “none”

  • num_clusters: Optional; needed if global_method = "k_means"

  • cut_height: Optional; needed if global_method = "hier_clust" and num_clusters is not supplied.

Changes to tidyclust framework:

None; however, the microcluster-to-additional-clustering flow is probably worth discussing more broadly.

Frequent Itemset Mining

Engine

Currently: arules

Parameters

Two exposed parameters:

  • min_support: Between 0 and 1, the support needed for an itemset to be considered frequent.

  • mining_method: Either “arules” or “eclat”, two distinct algorithms for searching for itemsets.

Changes to algorithm:

No changes to the engine fitting process.

However, “partition” style clustering on the columns requires each item to only be in one set. A new procedure has been implemented to select non-overlapping itemsets, prioritizing larger size and then higher support.

Changes to tidyclust framework:

Many significant changes. This should be discussed in-depth for all semi-supervised and/or “column clustering” methods.

Questions/Discussion Items

Outlier handling

At first I wanted the convention of naming these Cluster_0_1 etc. This way they are all denoted as Cluster_0 outliers, but the additional numbering keeps them from being naively grouped, as these should not be treated as a cohesive set.

However, the student implementing DBSCAN pointed out that this results in a huge number of levels in the cluster names variable, whereas Outlier is cleaner.

Other modes besides partition

I’m on the fence about whether outliers alone necessitate a new mode. On the one hand, they can be easily reported in the partition framework. On the other, it’s misleading to suggest we will supply a partition and then not return that.

Overlap points will need a new mode, and I think this will be important to do at some point, as many methods we want to implement eventually will allow overlap. I’m inclined to call this extraction or detection. This would allow each point to be in multiple clusters, or in zero.

Column clustering, such as in frequent itemset mining, is a whole can of worms. The structure of the extract_cluster_summary() and predict() function returns that we have chosen are not compatible with current tidyclust design and would definitely need a new mode to be incorporated.

The rabbit hole of semi-supervised learning

This might be more easily discussed “live”. The gist in the case of Frequent Itemset Mining is:

  • The “clusters” returned are sets of items (columns) rather than sets of rows (transaction receipts). These are discovered in an unsupervised way.

  • prediction becomes weird, because typically we aren’t receiving a new observation (row) to predict on; we are receiving a partially complete observation to fill in. (e.g. “customer bought bread and eggs, will they buy milk?”)

  • Prediction also becomes weird because this structure has a supervised version, where we know if the customer bought milk or not, and we can assess model performance.