Cramer V

Data

# Sample data: counts of preferences by gender
data <- matrix(c(
  20, 30, 25,  # Male: electronics, clothing, food
  15, 40, 20   # Female: electronics, clothing, food
), nrow = 2, byrow = TRUE)

# Add row and column names
rownames(data) <- c("Male", "Female")
colnames(data) <- c("Electronics", "Clothing", "Food")

# Show the contingency table
print(data)
       Electronics Clothing Food
Male            20       30   25
Female          15       40   20

Chi square

# Perform chi-square test
chisq_result <- chisq.test(data)

# View results
chisq_result

    Pearson's Chi-squared test

data:  data
X-squared = 2.6984, df = 2, p-value = 0.2594

Cramer’s V

# Install package if needed
#install.packages("lsr")  # only once
library(lsr)

# Calculate Cramér's V
cramersV(data)
[1] 0.1341246

Interpretation

Chi-square test will give you a p-value to say if the association is statistically significant.

Cramér’s V will tell you the strength:

~0.1 → weak

~0.3 → moderate

~0.5+ → strong

Example on Cohen’s D

Cohen’s d is a measure of effect size used to indicate the standardized difference between two means. It tells you how large the difference is, in terms of standard deviations.

✅ Purpose of Cohen’s d Quantifies the magnitude of the difference between two groups (e.g. treatment vs. control).

Often used in t-tests, psychology, education, and medical research.

\[ d = \frac{M_1 - M_2}{SD_{\text{pooled}}} \quad \text{where} \quad SD_{\text{pooled}} = \sqrt{\frac{SD_1^2 + SD_2^2}{2}} \]

# Sample data
group1 <- c(100, 102, 98, 95, 101)   # e.g. control
group2 <- c(110, 108, 112, 107, 111) # e.g. treatment

# Install effectsize package (if needed)
#install.packages("effectsize") 
library(effectsize)

# Compute Cohen's d
cohens_d(group2, group1)
Cohen's d |       95% CI
------------------------
4.25      | [1.83, 6.60]

- Estimated using pooled SD.

Interpretation (per Cohen’s conventions) Cohen’s d Effect Size 0.0 – 0.2 Small 0.2 – 0.5 Medium 0.5 – 0.8 Large > 0.8 Very large

These are general guidelines—context matters.

Adjusted Rand Index

✅ Purpose of ARI To compare the true labels (e.g. ground truth) with the clustering results from an algorithm (like K-means).

To assess how well a clustering algorithm has performed.

🧮 Rand Index (RI) vs. Adjusted Rand Index (ARI) The Rand Index (RI) counts the proportion of pairs of elements that are assigned consistently in both clusterings (same cluster in both, or different clusters in both).

However, RI can be biased — high even for random clusterings.

So, ARI adjusts for chance.

\[ \text{ARI} = \frac{ \sum_{ij} \binom{n_{ij}}{2} - \left[ \sum_i \binom{a_i}{2} \sum_j \binom{b_j}{2} \middle/ \binom{n}{2} \right] }{ \frac{1}{2} \left[ \sum_i \binom{a_i}{2} + \sum_j \binom{b_j}{2} \right] - \left[ \sum_i \binom{a_i}{2} \sum_j \binom{b_j}{2} \middle/ \binom{n}{2} \right] } \]

\[\begin{align*} & n_{ij} \text{ is the number of elements in both cluster } i \text{ of } U \text{ and cluster } j \text{ of } V, \\ & a_i = \sum_j n_{ij} \quad \text{(sum over row } i\text{)}, \\ & b_j = \sum_i n_{ij} \quad \text{(sum over column } j\text{)}, \\ & n = \sum_{ij} n_{ij} \quad \text{(total number of samples)}. \end{align*}\]

Where:

ARI = 1 → perfect agreement

ARI ≈ 0 → random labeling

ARI < 0 → worse than random

# Install needed package
#install.packages("mclust")
library(mclust)

# True labels vs. predicted clusters
true_labels <- c(1, 1, 0, 0, 2, 2)
predicted_clusters <- c(1, 1, 2, 2, 3, 3)

# Compute Adjusted Rand Index
adjustedRandIndex(true_labels, predicted_clusters)
[1] 1

Normalized mutual information (NMI)

Normalized Mutual Information (NMI) is another popular measure for evaluating the similarity between two clusterings, like the Adjusted Rand Index — but it’s based on information theory rather than pair counting.

✅ What is NMI? NMI measures the mutual dependence between two clusterings — how much information one clustering gives about the other — and normalizes the result so that it ranges from 0 to 1:

NMI

1 NMI=1: perfect match

NMI

0 NMI=0: completely independent (no mutual information)

\[ \text{NMI}(U, V) = \frac{2 \cdot I(U; V)}{H(U) + H(V)} \]

\[\begin{align*} I(U; V) &= \sum_{i=1}^{|U|} \sum_{j=1}^{|V|} \frac{n_{ij}}{n} \log \left( \frac{n_{ij} \cdot n}{n_{i\cdot} \cdot n_{\cdot j}} \right) \\ H(U) &= - \sum_{i=1}^{|U|} \frac{n_{i\cdot}}{n} \log \left( \frac{n_{i\cdot}}{n} \right) \\ H(V) &= - \sum_{j=1}^{|V|} \frac{n_{\cdot j}}{n} \log \left( \frac{n_{\cdot j}}{n} \right) \end{align*}\]

Where:

# Install package if needed
# install.packages("aricode")
library(aricode)

# Example labels
true_labels <- c(1, 1, 0, 0, 2, 2)
predicted_clusters <- c(1, 1, 2, 2, 3, 3)

# Compute Normalized Mutual Information
NMI(true_labels, predicted_clusters)
[1] 1

Comparison with ARI Measure Based on Range Sensitive to Interpretation ARI Pair counting [-1, 1] Overlap of pairs 1 = perfect match NMI Information theory [0, 1] Shared information 1 = perfect match

Gini impurity

Gini impurity is a measure originally used in decision trees (e.g., CART – Classification and Regression Trees) to quantify how “pure” a node is. While not typically used in standard unsupervised clustering (like K-means), it can be applied to evaluate the purity of clusters with respect to known class labels (i.e., in a supervised validation context).

Important: Gini impurity is not a clustering criterion itself but rather a post-hoc evaluation metric to assess how homogeneous clusters are with respect to some known labels.

The Gini impurity of a cluster \(C\) is defined as:

\[ \text{Gini}(C) = 1 - \sum_{k=1}^{K} p_k^2 \]

where \(p_k\) is the proportion of elements in cluster \(C\) that belong to class \(k\). It reaches its minimum (0) when all elements belong to a single class.

interpretation

The Gini impurity of a cluster \(C\) lies within the range:

\[ \text{Gini}(C) \in \left[ 0, 1 - \frac{1}{K} \right] \]

  • implies the cluster is — all instances belong to a single class.
  • impurity occurs when classes are uniformly distributed within the cluster. In this case, the Gini impurity approaches \(1 - \frac{1}{K}\), where \(K\) is the number of classes.

Examples:

  • Cluster with 100% of class “A”: Gini = 1 - (1)^2 = 0

  • Cluster with 50% class “A” and 50% class “B”: Gini = 1 - (0.5^2 + 0.5^2) = 0.5

  • Cluster with equal thirds of classes A, B, and C: \[ \text{Gini} = 1 - 3 \times \left(\frac{1}{3}\right)^2 = 1 - \frac{1}{3} = \frac{2}{3} \approx 0.666 \]

# Example in R
set.seed(123)

# Create a simple dataset with labels
library(dplyr)
n <- 100
x <- rbind(
  matrix(rnorm(n, mean=0), ncol=2),
  matrix(rnorm(n, mean=3), ncol=2)
)
labels <- factor(c(rep("A", n/2), rep("B", n/2)))
df <- data.frame(x1 = x[,1], x2 = x[,2], label = labels)

# Perform k-means clustering
kmeans_result <- kmeans(df[, c("x1", "x2")], centers = 2)
df$cluster <- kmeans_result$cluster

# Compute Gini impurity for each cluster
gini_impurity <- function(cluster_labels) {
  p <- prop.table(table(cluster_labels))
  1 - sum(p^2)
}

gini_by_cluster <- df %>%
  group_by(cluster) %>%
  summarise(gini = gini_impurity(label))

print(gini_by_cluster)
NA

Entropy as a Cluster Purity Measure

The entropy of a cluster \(C\) with respect to class distribution is defined as:

\[ \text{Entropy}(C) = - \sum_{k=1}^{K} p_k \log_2(p_k) \]

where \(p_k\) is the proportion of class \(k\) in cluster \(C\).

Interpretation:

  • If all points in a cluster belong to the same class: \(\text{Entropy} = 0\)
  • If classes are equally mixed: \(\text{Entropy} = \log_2(K)\)

To compute the total entropy across all clusters:

\[ \text{Total Entropy} = \sum_{i=1}^{M} \frac{n_i}{n} \cdot \text{Entropy}(C_i) \]

where:

  • \(M\): total number of clusters
  • \(n_i\): number of items in cluster \(C_i\)
  • \(n\): total number of data points
# Assume df has 'label' and 'cluster' columns as before
library(dplyr)

# Entropy function
entropy <- function(cluster_labels) {
  p <- prop.table(table(cluster_labels))
  -sum(p * log2(p), na.rm = TRUE)
}

# Calculate entropy per cluster
entropy_by_cluster <- df %>%
  group_by(cluster) %>%
  summarise(entropy = entropy(label),
            n = n()) %>%
  mutate(weight = n / sum(n),
         weighted_entropy = entropy * weight)

# Total entropy
total_entropy <- sum(entropy_by_cluster$weighted_entropy)
print(entropy_by_cluster)
print(paste("Total Entropy:", round(total_entropy, 4)))
[1] "Total Entropy: 0.071"

CART: Classification and Regression Trees

Explanation

CART is a decision tree learning technique that can be used for both classification (categorical outcome) and regression (continuous outcome). It works by recursively splitting the data into subsets based on the feature that yields the highest purity gain.

In classification, the algorithm:

  • Starts with all the data at the root node.

  • Splits it using the variable and threshold that maximizes the decrease in impurity (e.g., Gini impurity or Entropy).

  • Repeats the process on each subset until a stopping criterion is met (e.g., minimum samples per node or max depth).

Maths:

Let \(D\) be the dataset at a node, with \(K\) classes. The Gini impurity is defined as:

\[ Gini(D) = 1 - \sum_{k=1}^K p_k^2 \]

where \(p_k\) is the proportion of observations in class \(k\) at the node.

Given a split \(s\) that partitions \(D\) into two subsets \(D_1\) and \(D_2\), the impurity of the split is:

\[ Gini_{split}(s) = \frac{|D_1|}{|D|} Gini(D_1) + \frac{|D_2|}{|D|} Gini(D_2) \]

The best split minimizes this impurity.

Example in R

# Load required library
library(rpart)
library(rpart.plot)

# Load a sample dataset
data(iris)

# Convert Species to binary classification (e.g., setosa vs. others)
iris$BinarySpecies <- ifelse(iris$Species == "setosa", "setosa", "other")
iris$BinarySpecies <- as.factor(iris$BinarySpecies)

# Build a CART classification model
cart_model <- rpart(BinarySpecies ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
                    data = iris, method = "class")

# Print model summary
print(cart_model)
n= 150 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 150 50 other (0.6666667 0.3333333)  
  2) Petal.Length>=2.45 100  0 other (1.0000000 0.0000000) *
  3) Petal.Length< 2.45 50  0 setosa (0.0000000 1.0000000) *
# Plot the tree
rpart.plot(cart_model, type = 4, extra = 104, fallen.leaves = TRUE)

Notes:

  • method = “class” ensures it performs classification.
  • The impurity metric used by default is Gini impurity.
  • The model chooses splits that minimize the weighted sum of child node impurities.

CART Classification: Entropy

Given a dataset \(D\) with \(K\) classes, the entropy is defined as:

\[ Entropy(D) = - \sum_{k=1}^K p_k \log_2 p_k \]

where \(p_k\) is the proportion of class \(k\) in \(D\).

For a split \(s\) that divides \(D\) into \(D_1\) and \(D_2\), the entropy after the split is:

\[ Entropy_{split}(s) = \frac{|D_1|}{|D|} Entropy(D_1) + \frac{|D_2|}{|D|} Entropy(D_2) \]

The information gain is:

\[ Gain(s) = Entropy(D) - Entropy_{split}(s) \]

The best split maximizes \(Gain(s)\).

library(RWeka)

# Convert Species to binary classification
data(iris)
iris$BinarySpecies <- ifelse(iris$Species == "setosa", "setosa", "other")
iris$BinarySpecies <- as.factor(iris$BinarySpecies)

# Build an entropy-based decision tree using J48 (C4.5)
model_entropy <- J48(BinarySpecies ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = iris)

# Print the model
summary(model_entropy)

=== Summary ===

Correctly Classified Instances         150              100      %
Incorrectly Classified Instances         0                0      %
Kappa statistic                          1     
Mean absolute error                      0     
Root mean squared error                  0     
Relative absolute error                  0      %
Root relative squared error              0      %
Total Number of Instances              150     

=== Confusion Matrix ===

   a   b   <-- classified as
 100   0 |   a = other
   0  50 |   b = setosa

HDBSCAN: Hierarchical Density-Based Clustering

Overview

  • HDBSCAN is an advanced clustering method that:

  • Extends DBSCAN by converting it into a hierarchical clustering algorithm.

  • Extracts the most stable clusters from the hierarchy.

  • Automatically determines the number of clusters and handles noise/outliers effectively.

It is non-parametric (except for minPts) and works well for clusters of varying density, unlike DBSCAN.

HDBSCAN Clustering

Let \(x, y \in \mathbb{R}^d\) be data points. Define:

1. Core distance:

\[ \text{core}_{k}(x) = \text{distance to the } k\text{-th nearest neighbor of } x \]

** 2. Mutual Reachability Distance (MRD):

\[ \text{MRD}(x, y) = \max \left\{ \text{core}_k(x), \text{core}_k(y), \|x - y\| \right\} \]

3. Cluster hierarchy:

Build a minimum spanning tree (MST) from all pairwise MRDs. Perform single-linkage clustering to obtain a hierarchy.

4. Stability:

Let \(C\) be a cluster, and \(\lambda = \frac{1}{\text{MRD}}\). Define the cluster stability as:

\[ \text{Stability}(C) = \sum_{x \in C} (\lambda_{\text{birth}}(x) - \lambda_{\text{death}}(x)) \]

Only the most stable clusters are retained.

Example

# Install required package if not already
# install.packages("dbscan")

library(dbscan)
library(ggplot2)

# Simulated 2D data
set.seed(42)
n <- 100
x <- cbind(
  x = c(rnorm(n, 0, 0.3), rnorm(n, 3, 0.3)),
  y = c(rnorm(n, 0, 0.3), rnorm(n, 3, 0.3))
)

# HDBSCAN clustering
hdb <- hdbscan(x, minPts = 10)

# Print results
print(hdb)
HDBSCAN clustering for 200 objects.
Parameters: minPts = 10
The clustering contains 2 cluster(s) and 0 noise points.

  1   2 
100 100 

Available fields: cluster, minPts, coredist, cluster_scores, membership_prob, outlier_scores, hc
# Plot clusters
df <- data.frame(x, cluster = factor(hdb$cluster))
ggplot(df, aes(x = x, y = y, color = cluster)) +
  geom_point(size = 3) +
  ggtitle("HDBSCAN Clustering") +
  theme_minimal()

When to Use HDBSCAN

  • When clusters have different densities.

  • When number of clusters is unknown.

  • When noise/outliers should be automatically detected.

