Spring 2025 MS Student Updates
Overview
Density-based
Functions
db_clust()
specification
Engine:
Currently uses DBSCAN
.
I’m not aware of other candidate engines.
Parameters:
Two exposed params: radius
(eps
in engine) and min_pts
(MinPts
in engine).
Functions in dials
:
radius
: default values between -2 and 2, base log 10 transformation.min_pts
: default values between 3 and 50
Changes to algorithm:
Major change: prediction uses only core points, so that prediction output and training output match.
Minor change: points in “overlap zone” are assigned to closest cluster, not first cluster.
Changes to tidyclust
framework:
Outliers are now possible. These are denoted by Outlier
instead of Cluster_N
. In extract_cluster_summary()
, the cluster center for the outlier group is deliberately returned as NA
, to prevent users from treating the collection of outliers as a cluster.
We should consider if we like this notation. Another option is Cluster_0
(to preserve the “Cluster” name for other functions to work) or even Cluster_0_1
, Cluster_0_2
, etc. to treat each outlier as its own “cluster of one” for partition purposes.
In principle, overlap points could now be possible, with membership in multiple clusters. This is not implemented as of now - these points are assigned to the nearest cluster - but they could be incorporated if a mode besides partition
were added.
Mixture Models
Functions
gm_clust()
specification
Engine
Currently uses mclust
.
Other candidates: Rmixmod
, mixture
, EMCluster
, mixtools
Parameters
Six exposed params num_clusters
, and five boolean params to control the multivariate Gaussian mixture model specification:
circular
zero_covariance
shared_orientation
shared_shape
shared_size
The five boolean inputs replace the mclust
specification style by model “name”, e.g. EII
, EVE
etc. This is a big API change, but one that allows the function to be more intuitive and self-documenting.
In all cases, values of TRUE
indicate a more restricted model. Default values are TRUE
; which means only two parameters, circular
and shared_size
are used.
In dials
:
num_clusters()
was already present and is unchanged.- The five boolean inputs give
c(TRUE, FALSE)
for tuning as expected. - (Note that because not every combination of boolean parameters is a viable model, tuning an exhaustive grid results in repeating some of the model types.)
Changes to algorithm:
The mclust()
function will fit all supplied MV Gaussian model types.
The gm_clust()
specification only fits one Gaussian model; and expects tuning structure to be used to fit multiple.
BIRCH
Engine
Currently: BIRCH
.
No others we know of.
Parameters
radius_threshold
: The maximum radius allowed for a subcluster within a leaf node of the CF-tree. (Default 0.5)branching_factor
: The maximum number of child nodes in a non-leaf node of the CF-tree. (Default 50)max_leaf
: The maximum number of microclusters or CF entries allowed in a leaf node. (Default 100).global_method
: An additional clustering to be applied to the microclusters produced. Currently only “hier_clust” and “k_means” are available. Default: “none”num_clusters
: Optional; needed ifglobal_method = "k_means"
cut_height
: Optional; needed ifglobal_method = "hier_clust"
andnum_clusters
is not supplied.
Changes to tidyclust
framework:
None; however, the microcluster-to-additional-clustering flow is probably worth discussing more broadly.
Frequent Itemset Mining
Engine
Currently: arules
Parameters
Two exposed parameters:
min_support
: Between 0 and 1, the support needed for an itemset to be considered frequent.mining_method
: Either “arules” or “eclat”, two distinct algorithms for searching for itemsets.
Changes to algorithm:
No changes to the engine fitting process.
However, “partition” style clustering on the columns requires each item to only be in one set. A new procedure has been implemented to select non-overlapping itemsets, prioritizing larger size and then higher support.
Changes to tidyclust
framework:
Many significant changes. This should be discussed in-depth for all semi-supervised and/or “column clustering” methods.
Questions/Discussion Items
Outlier handling
At first I wanted the convention of naming these Cluster_0_1
etc. This way they are all denoted as Cluster_0
outliers, but the additional numbering keeps them from being naively grouped, as these should not be treated as a cohesive set.
However, the student implementing DBSCAN
pointed out that this results in a huge number of levels in the cluster names variable, whereas Outlier
is cleaner.
Other modes besides partition
I’m on the fence about whether outliers alone necessitate a new mode. On the one hand, they can be easily reported in the partition framework. On the other, it’s misleading to suggest we will supply a partition and then not return that.
Overlap points will need a new mode, and I think this will be important to do at some point, as many methods we want to implement eventually will allow overlap. I’m inclined to call this extraction
or detection
. This would allow each point to be in multiple clusters, or in zero.
Column clustering, such as in frequent itemset mining, is a whole can of worms. The structure of the extract_cluster_summary()
and predict()
function returns that we have chosen are not compatible with current tidyclust
design and would definitely need a new mode to be incorporated.
The rabbit hole of semi-supervised learning
This might be more easily discussed “live”. The gist in the case of Frequent Itemset Mining is:
The “clusters” returned are sets of items (columns) rather than sets of rows (transaction receipts). These are discovered in an unsupervised way.
prediction becomes weird, because typically we aren’t receiving a new observation (row) to predict on; we are receiving a partially complete observation to fill in. (e.g. “customer bought bread and eggs, will they buy milk?”)
Prediction also becomes weird because this structure has a supervised version, where we know if the customer bought milk or not, and we can assess model performance.