Cramer V
Data
# Sample data: counts of preferences by gender
data <- matrix(c(
20, 30, 25, # Male: electronics, clothing, food
15, 40, 20 # Female: electronics, clothing, food
), nrow = 2, byrow = TRUE)
# Add row and column names
rownames(data) <- c("Male", "Female")
colnames(data) <- c("Electronics", "Clothing", "Food")
# Show the contingency table
print(data)
Electronics Clothing Food
Male 20 30 25
Female 15 40 20
Chi square
# Perform chi-square test
chisq_result <- chisq.test(data)
# View results
chisq_result
Pearson's Chi-squared test
data: data
X-squared = 2.6984, df = 2, p-value = 0.2594
Cramer’s V
# Install package if needed
#install.packages("lsr") # only once
library(lsr)
# Calculate Cramér's V
cramersV(data)
[1] 0.1341246
Interpretation
Chi-square test will give you a p-value to say if the association is
statistically significant.
Cramér’s V will tell you the strength:
~0.1 → weak
~0.3 → moderate
~0.5+ → strong
Example on Cohen’s D
Cohen’s d is a measure of effect size used to indicate the
standardized difference between two means. It tells you how large the
difference is, in terms of standard deviations.
✅ Purpose of Cohen’s d Quantifies the magnitude of the difference
between two groups (e.g. treatment vs. control).
Often used in t-tests, psychology, education, and medical
research.
\[
d = \frac{M_1 - M_2}{SD_{\text{pooled}}}
\quad \text{where} \quad
SD_{\text{pooled}} = \sqrt{\frac{SD_1^2 + SD_2^2}{2}}
\]
# Sample data
group1 <- c(100, 102, 98, 95, 101) # e.g. control
group2 <- c(110, 108, 112, 107, 111) # e.g. treatment
# Install effectsize package (if needed)
#install.packages("effectsize")
library(effectsize)
# Compute Cohen's d
cohens_d(group2, group1)
Cohen's d | 95% CI
------------------------
4.25 | [1.83, 6.60]
- Estimated using pooled SD.
Interpretation (per Cohen’s conventions) Cohen’s d Effect Size 0.0 –
0.2 Small 0.2 – 0.5 Medium 0.5 – 0.8 Large > 0.8 Very large
These are general guidelines—context matters.
Adjusted Rand Index
✅ Purpose of ARI To compare the true labels (e.g. ground truth) with
the clustering results from an algorithm (like K-means).
To assess how well a clustering algorithm has performed.
🧮 Rand Index (RI) vs. Adjusted Rand Index (ARI) The Rand Index (RI)
counts the proportion of pairs of elements that are assigned
consistently in both clusterings (same cluster in both, or different
clusters in both).
However, RI can be biased — high even for random clusterings.
So, ARI adjusts for chance.
\[
\text{ARI} = \frac{
\sum_{ij} \binom{n_{ij}}{2}
- \left[ \sum_i \binom{a_i}{2} \sum_j \binom{b_j}{2} \middle/
\binom{n}{2} \right]
}{
\frac{1}{2} \left[ \sum_i \binom{a_i}{2} + \sum_j \binom{b_j}{2} \right]
- \left[ \sum_i \binom{a_i}{2} \sum_j \binom{b_j}{2} \middle/
\binom{n}{2} \right]
}
\]
\[\begin{align*}
& n_{ij} \text{ is the number of elements in both cluster } i \text{
of } U \text{ and cluster } j \text{ of } V, \\
& a_i = \sum_j n_{ij} \quad \text{(sum over row } i\text{)}, \\
& b_j = \sum_i n_{ij} \quad \text{(sum over column } j\text{)}, \\
& n = \sum_{ij} n_{ij} \quad \text{(total number of samples)}.
\end{align*}\]
Where:
ARI = 1 → perfect agreement
ARI ≈ 0 → random labeling
ARI < 0 → worse than random
# Install needed package
#install.packages("mclust")
library(mclust)
# True labels vs. predicted clusters
true_labels <- c(1, 1, 0, 0, 2, 2)
predicted_clusters <- c(1, 1, 2, 2, 3, 3)
# Compute Adjusted Rand Index
adjustedRandIndex(true_labels, predicted_clusters)
[1] 1
NMI
1 NMI=1: perfect match
NMI
0 NMI=0: completely independent (no mutual information)
\[
\text{NMI}(U, V) = \frac{2 \cdot I(U; V)}{H(U) + H(V)}
\]
\[\begin{align*}
I(U; V) &= \sum_{i=1}^{|U|} \sum_{j=1}^{|V|}
\frac{n_{ij}}{n} \log \left( \frac{n_{ij} \cdot n}{n_{i\cdot} \cdot
n_{\cdot j}} \right) \\
H(U) &= - \sum_{i=1}^{|U|} \frac{n_{i\cdot}}{n} \log \left(
\frac{n_{i\cdot}}{n} \right) \\
H(V) &= - \sum_{j=1}^{|V|} \frac{n_{\cdot j}}{n} \log \left(
\frac{n_{\cdot j}}{n} \right)
\end{align*}\]
Where:
- \(U, V\) are the two
clusterings.
- \(n_{ij}\): number of samples in
cluster \(i\) of \(U\) and cluster \(j\) of \(V\).
- \(n_{i\cdot}, n_{\cdot j}\): row
and column sums of the contingency table.
- \(n\): total number of
samples.
- \(I(U; V)\): mutual information
between \(U\) and \(V\).
- \(H(U), H(V)\): entropies of the
clusterings \(U\) and \(V\), respectively.
# Install package if needed
# install.packages("aricode")
library(aricode)
# Example labels
true_labels <- c(1, 1, 0, 0, 2, 2)
predicted_clusters <- c(1, 1, 2, 2, 3, 3)
# Compute Normalized Mutual Information
NMI(true_labels, predicted_clusters)
[1] 1
Comparison with ARI Measure Based on Range Sensitive to
Interpretation ARI Pair counting [-1, 1] Overlap of pairs 1 = perfect
match NMI Information theory [0, 1] Shared information 1 = perfect
match
Gini impurity
Gini impurity is a measure originally used in decision trees (e.g.,
CART – Classification and Regression Trees) to quantify how “pure” a
node is. While not typically used in standard unsupervised clustering
(like K-means), it can be applied to evaluate the purity of clusters
with respect to known class labels (i.e., in a supervised validation
context).
Important: Gini impurity is not a clustering criterion itself but
rather a post-hoc evaluation metric to assess how homogeneous clusters
are with respect to some known labels.
The Gini impurity of a cluster \(C\)
is defined as:
\[
\text{Gini}(C) = 1 - \sum_{k=1}^{K} p_k^2
\]
where \(p_k\) is the proportion of
elements in cluster \(C\) that belong
to class \(k\). It reaches its minimum
(0) when all elements belong to a single class.
interpretation
The Gini impurity of a cluster \(C\)
lies within the range:
\[
\text{Gini}(C) \in \left[ 0, 1 - \frac{1}{K} \right]
\]
- implies the cluster is — all instances belong to a single
class.
- impurity occurs when classes are uniformly distributed within the
cluster. In this case, the Gini impurity approaches \(1 - \frac{1}{K}\), where \(K\) is the number of classes.
Examples:
Cluster with 100% of class “A”: Gini = 1 - (1)^2 = 0
Cluster with 50% class “A” and 50% class “B”: Gini = 1 - (0.5^2 +
0.5^2) = 0.5
Cluster with equal thirds of classes A, B, and C: \[
\text{Gini} = 1 - 3 \times \left(\frac{1}{3}\right)^2 = 1 - \frac{1}{3}
= \frac{2}{3} \approx 0.666
\]
# Example in R
set.seed(123)
# Create a simple dataset with labels
library(dplyr)
n <- 100
x <- rbind(
matrix(rnorm(n, mean=0), ncol=2),
matrix(rnorm(n, mean=3), ncol=2)
)
labels <- factor(c(rep("A", n/2), rep("B", n/2)))
df <- data.frame(x1 = x[,1], x2 = x[,2], label = labels)
# Perform k-means clustering
kmeans_result <- kmeans(df[, c("x1", "x2")], centers = 2)
df$cluster <- kmeans_result$cluster
# Compute Gini impurity for each cluster
gini_impurity <- function(cluster_labels) {
p <- prop.table(table(cluster_labels))
1 - sum(p^2)
}
gini_by_cluster <- df %>%
group_by(cluster) %>%
summarise(gini = gini_impurity(label))
print(gini_by_cluster)
NA
Entropy as a Cluster Purity Measure
The entropy of a cluster \(C\) with
respect to class distribution is defined as:
\[
\text{Entropy}(C) = - \sum_{k=1}^{K} p_k \log_2(p_k)
\]
where \(p_k\) is the proportion of
class \(k\) in cluster \(C\).
Interpretation:
- If all points in a cluster belong to the same class: \(\text{Entropy} = 0\)
- If classes are equally mixed: \(\text{Entropy} = \log_2(K)\)
To compute the total entropy across all clusters:
\[
\text{Total Entropy} = \sum_{i=1}^{M} \frac{n_i}{n} \cdot
\text{Entropy}(C_i)
\]
where:
- \(M\): total number of
clusters
- \(n_i\): number of items in cluster
\(C_i\)
- \(n\): total number of data
points
# Assume df has 'label' and 'cluster' columns as before
library(dplyr)
# Entropy function
entropy <- function(cluster_labels) {
p <- prop.table(table(cluster_labels))
-sum(p * log2(p), na.rm = TRUE)
}
# Calculate entropy per cluster
entropy_by_cluster <- df %>%
group_by(cluster) %>%
summarise(entropy = entropy(label),
n = n()) %>%
mutate(weight = n / sum(n),
weighted_entropy = entropy * weight)
# Total entropy
total_entropy <- sum(entropy_by_cluster$weighted_entropy)
print(entropy_by_cluster)
print(paste("Total Entropy:", round(total_entropy, 4)))
[1] "Total Entropy: 0.071"
CART: Classification and Regression Trees
Explanation
CART is a decision tree learning technique that can be used for both
classification (categorical outcome) and regression (continuous
outcome). It works by recursively splitting the data into subsets based
on the feature that yields the highest purity gain.
In classification, the algorithm:
Starts with all the data at the root node.
Splits it using the variable and threshold that maximizes the
decrease in impurity (e.g., Gini impurity or Entropy).
Repeats the process on each subset until a stopping criterion is
met (e.g., minimum samples per node or max depth).
Maths:
Let \(D\) be the dataset at a node,
with \(K\) classes. The Gini impurity
is defined as:
\[
Gini(D) = 1 - \sum_{k=1}^K p_k^2
\]
where \(p_k\) is the proportion of
observations in class \(k\) at the
node.
Given a split \(s\) that partitions
\(D\) into two subsets \(D_1\) and \(D_2\), the impurity of the split is:
\[
Gini_{split}(s) = \frac{|D_1|}{|D|} Gini(D_1) + \frac{|D_2|}{|D|}
Gini(D_2)
\]
The best split minimizes this impurity.
Example in R
# Load required library
library(rpart)
library(rpart.plot)
# Load a sample dataset
data(iris)
# Convert Species to binary classification (e.g., setosa vs. others)
iris$BinarySpecies <- ifelse(iris$Species == "setosa", "setosa", "other")
iris$BinarySpecies <- as.factor(iris$BinarySpecies)
# Build a CART classification model
cart_model <- rpart(BinarySpecies ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data = iris, method = "class")
# Print model summary
print(cart_model)
n= 150
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 150 50 other (0.6666667 0.3333333)
2) Petal.Length>=2.45 100 0 other (1.0000000 0.0000000) *
3) Petal.Length< 2.45 50 0 setosa (0.0000000 1.0000000) *
# Plot the tree
rpart.plot(cart_model, type = 4, extra = 104, fallen.leaves = TRUE)

Notes:
- method = “class” ensures it performs classification.
- The impurity metric used by default is Gini impurity.
- The model chooses splits that minimize the weighted sum of child
node impurities.
CART Classification: Entropy
Given a dataset \(D\) with \(K\) classes, the entropy is defined as:
\[
Entropy(D) = - \sum_{k=1}^K p_k \log_2 p_k
\]
where \(p_k\) is the proportion of
class \(k\) in \(D\).
For a split \(s\) that divides \(D\) into \(D_1\) and \(D_2\), the entropy after the split is:
\[
Entropy_{split}(s) = \frac{|D_1|}{|D|} Entropy(D_1) + \frac{|D_2|}{|D|}
Entropy(D_2)
\]
The information gain is:
\[
Gain(s) = Entropy(D) - Entropy_{split}(s)
\]
The best split maximizes \(Gain(s)\).
library(RWeka)
# Convert Species to binary classification
data(iris)
iris$BinarySpecies <- ifelse(iris$Species == "setosa", "setosa", "other")
iris$BinarySpecies <- as.factor(iris$BinarySpecies)
# Build an entropy-based decision tree using J48 (C4.5)
model_entropy <- J48(BinarySpecies ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = iris)
# Print the model
summary(model_entropy)
=== Summary ===
Correctly Classified Instances 150 100 %
Incorrectly Classified Instances 0 0 %
Kappa statistic 1
Mean absolute error 0
Root mean squared error 0
Relative absolute error 0 %
Root relative squared error 0 %
Total Number of Instances 150
=== Confusion Matrix ===
a b <-- classified as
100 0 | a = other
0 50 | b = setosa
HDBSCAN: Hierarchical Density-Based Clustering
Overview
HDBSCAN is an advanced clustering method that:
Extends DBSCAN by converting it into a hierarchical clustering
algorithm.
Extracts the most stable clusters from the hierarchy.
Automatically determines the number of clusters and handles
noise/outliers effectively.
It is non-parametric (except for minPts) and works well for clusters
of varying density, unlike DBSCAN.
HDBSCAN Clustering
Let \(x, y \in \mathbb{R}^d\) be
data points. Define:
1. Core distance:
\[
\text{core}_{k}(x) = \text{distance to the } k\text{-th nearest neighbor
of } x
\]
** 2. Mutual Reachability Distance (MRD):
\[
\text{MRD}(x, y) = \max \left\{ \text{core}_k(x), \text{core}_k(y), \|x
- y\| \right\}
\]
3. Cluster hierarchy:
Build a minimum spanning tree (MST) from all pairwise MRDs. Perform
single-linkage clustering to obtain a hierarchy.
4. Stability:
Let \(C\) be a cluster, and \(\lambda = \frac{1}{\text{MRD}}\). Define
the cluster stability as:
\[
\text{Stability}(C) = \sum_{x \in C} (\lambda_{\text{birth}}(x) -
\lambda_{\text{death}}(x))
\]
Only the most stable clusters are retained.
Example
# Install required package if not already
# install.packages("dbscan")
library(dbscan)
library(ggplot2)
# Simulated 2D data
set.seed(42)
n <- 100
x <- cbind(
x = c(rnorm(n, 0, 0.3), rnorm(n, 3, 0.3)),
y = c(rnorm(n, 0, 0.3), rnorm(n, 3, 0.3))
)
# HDBSCAN clustering
hdb <- hdbscan(x, minPts = 10)
# Print results
print(hdb)
HDBSCAN clustering for 200 objects.
Parameters: minPts = 10
The clustering contains 2 cluster(s) and 0 noise points.
1 2
100 100
Available fields: cluster, minPts, coredist, cluster_scores, membership_prob, outlier_scores, hc
# Plot clusters
df <- data.frame(x, cluster = factor(hdb$cluster))
ggplot(df, aes(x = x, y = y, color = cluster)) +
geom_point(size = 3) +
ggtitle("HDBSCAN Clustering") +
theme_minimal()

When to Use HDBSCAN
When clusters have different densities.
When number of clusters is unknown.
When noise/outliers should be automatically detected.
---
title: "AI methods"
output: html_notebook
---

## Cramer V



- Create a contingency table

- Perform a Chi-square test

- Calculate Cramér’s V

### Data

```{r}
# Sample data: counts of preferences by gender
data <- matrix(c(
  20, 30, 25,  # Male: electronics, clothing, food
  15, 40, 20   # Female: electronics, clothing, food
), nrow = 2, byrow = TRUE)

# Add row and column names
rownames(data) <- c("Male", "Female")
colnames(data) <- c("Electronics", "Clothing", "Food")

# Show the contingency table
print(data)

```

### Chi square

```{r}
# Perform chi-square test
chisq_result <- chisq.test(data)

# View results
chisq_result

```

## Cramer's V

```{r}
# Install package if needed
#install.packages("lsr")  # only once
library(lsr)

# Calculate Cramér's V
cramersV(data)

```


### Interpretation

Chi-square test will give you a p-value to say if the association is statistically significant.

Cramér’s V will tell you the strength:

~0.1 → weak

~0.3 → moderate

~0.5+ → strong



## Example on Cohen's D

Cohen's d is a measure of effect size used to indicate the standardized difference between two means. It tells you how large the difference is, in terms of standard deviations.

✅ Purpose of Cohen's d
Quantifies the magnitude of the difference between two groups (e.g. treatment vs. control).

Often used in t-tests, psychology, education, and medical research.

\[
d = \frac{M_1 - M_2}{SD_{\text{pooled}}}
\quad \text{where} \quad
SD_{\text{pooled}} = \sqrt{\frac{SD_1^2 + SD_2^2}{2}}
\]


```{r}
# Sample data
group1 <- c(100, 102, 98, 95, 101)   # e.g. control
group2 <- c(110, 108, 112, 107, 111) # e.g. treatment

# Install effectsize package (if needed)
#install.packages("effectsize") 
library(effectsize)

# Compute Cohen's d
cohens_d(group2, group1)

```

Interpretation (per Cohen's conventions)
Cohen’s d	Effect Size
0.0 – 0.2	Small
0.2 – 0.5	Medium
0.5 – 0.8	Large
> 0.8	Very large

These are general guidelines—context matters.


## Adjusted Rand Index

✅ Purpose of ARI
To compare the true labels (e.g. ground truth) with the clustering results from an algorithm (like K-means).

To assess how well a clustering algorithm has performed.

🧮 Rand Index (RI) vs. Adjusted Rand Index (ARI)
The Rand Index (RI) counts the proportion of pairs of elements that are assigned consistently in both clusterings (same cluster in both, or different clusters in both).

However, RI can be biased — high even for random clusterings.

So, ARI adjusts for chance.

\[
\text{ARI} = \frac{
\sum_{ij} \binom{n_{ij}}{2}
- \left[ \sum_i \binom{a_i}{2} \sum_j \binom{b_j}{2} \middle/ \binom{n}{2} \right]
}{
\frac{1}{2} \left[ \sum_i \binom{a_i}{2} + \sum_j \binom{b_j}{2} \right]
- \left[ \sum_i \binom{a_i}{2} \sum_j \binom{b_j}{2} \middle/ \binom{n}{2} \right]
}
\]

\text{where:}
\begin{align*}
& n_{ij} \text{ is the number of elements in both cluster } i \text{ of } U \text{ and cluster } j \text{ of } V, \\
& a_i = \sum_j n_{ij} \quad \text{(sum over row } i\text{)}, \\
& b_j = \sum_i n_{ij} \quad \text{(sum over column } j\text{)}, \\
& n = \sum_{ij} n_{ij} \quad \text{(total number of samples)}.
\end{align*}


Where:

ARI = 1 → perfect agreement

ARI ≈ 0 → random labeling

ARI < 0 → worse than random

```{r}
# Install needed package
#install.packages("mclust")
library(mclust)

# True labels vs. predicted clusters
true_labels <- c(1, 1, 0, 0, 2, 2)
predicted_clusters <- c(1, 1, 2, 2, 3, 3)

# Compute Adjusted Rand Index
adjustedRandIndex(true_labels, predicted_clusters)

```

## Normalized mutual information (NMI)

Normalized Mutual Information (NMI) is another popular measure for evaluating the similarity between two clusterings, like the Adjusted Rand Index — but it's based on information theory rather than pair counting.

✅ What is NMI?
NMI measures the mutual dependence between two clusterings — how much information one clustering gives about the other — and normalizes the result so that it ranges from 0 to 1:

NMI
=
1
NMI=1: perfect match

NMI
=
0
NMI=0: completely independent (no mutual information)

\[
\text{NMI}(U, V) = \frac{2 \cdot I(U; V)}{H(U) + H(V)}
\]

\text{where:}

\begin{align*}
I(U; V) &= \sum_{i=1}^{|U|} \sum_{j=1}^{|V|} 
\frac{n_{ij}}{n} \log \left( \frac{n_{ij} \cdot n}{n_{i\cdot} \cdot n_{\cdot j}} \right) \\
H(U) &= - \sum_{i=1}^{|U|} \frac{n_{i\cdot}}{n} \log \left( \frac{n_{i\cdot}}{n} \right) \\
H(V) &= - \sum_{j=1}^{|V|} \frac{n_{\cdot j}}{n} \log \left( \frac{n_{\cdot j}}{n} \right)
\end{align*}


Where:

  - $U, V$ are the two clusterings.
  - $n_{ij}$: number of samples in cluster $i$ of $U$ and cluster $j$ of $V$.
  - $n_{i\cdot}, n_{\cdot j}$: row and column sums of the contingency table.
  - $n$: total number of samples.
  - $I(U; V)$: mutual information between $U$ and $V$.
  - $H(U), H(V)$: entropies of the clusterings $U$ and $V$, respectively.



```{r}
# Install package if needed
# install.packages("aricode")
library(aricode)

# Example labels
true_labels <- c(1, 1, 0, 0, 2, 2)
predicted_clusters <- c(1, 1, 2, 2, 3, 3)

# Compute Normalized Mutual Information
NMI(true_labels, predicted_clusters)

```

Comparison with ARI
Measure	Based on	Range	Sensitive to	Interpretation
ARI	Pair counting	[-1, 1]	Overlap of pairs	1 = perfect match
NMI	Information theory	[0, 1]	Shared information	1 = perfect match


## Gini impurity

Gini impurity is a measure originally used in decision trees (e.g., CART – Classification and Regression Trees) to quantify how "pure" a node is. While not typically used in standard unsupervised clustering (like K-means), it can be applied to evaluate the purity of clusters with respect to known class labels (i.e., in a supervised validation context).

Important:
Gini impurity is not a clustering criterion itself but rather a post-hoc evaluation metric to assess how homogeneous clusters are with respect to some known labels.

The Gini impurity of a cluster \( C \) is defined as:

\[
\text{Gini}(C) = 1 - \sum_{k=1}^{K} p_k^2
\]

where \( p_k \) is the proportion of elements in cluster \( C \) that belong to class \( k \). It reaches its minimum (0) when all elements belong to a single class.

### interpretation

The Gini impurity of a cluster \( C \) lies within the range:

\[
\text{Gini}(C) \in \left[ 0, 1 - \frac{1}{K} \right]
\]

-  \textbf{0} \quad implies the cluster is \emph{pure} — all instances belong to a single class.
- \textbf{Maximum} impurity occurs when classes are uniformly distributed within the cluster. In this case, the Gini impurity approaches \( 1 - \frac{1}{K} \), where \( K \) is the number of classes.


**Examples:**


- Cluster with 100\% of class "A": 
Gini = 1 - (1)^2 = 0

- Cluster with 50\% class "A" and 50\% class "B": 
Gini = 1 - \left(0.5^2 + 0.5^2\right) = 0.5


- Cluster with equal thirds of classes A, B, and C:
  \[
  \text{Gini} = 1 - 3 \times \left(\frac{1}{3}\right)^2 = 1 - \frac{1}{3} = \frac{2}{3} \approx 0.666
  \]



```{r}
# Example in R
set.seed(123)

# Create a simple dataset with labels
library(dplyr)
n <- 100
x <- rbind(
  matrix(rnorm(n, mean=0), ncol=2),
  matrix(rnorm(n, mean=3), ncol=2)
)
labels <- factor(c(rep("A", n/2), rep("B", n/2)))
df <- data.frame(x1 = x[,1], x2 = x[,2], label = labels)

# Perform k-means clustering
kmeans_result <- kmeans(df[, c("x1", "x2")], centers = 2)
df$cluster <- kmeans_result$cluster

# Compute Gini impurity for each cluster
gini_impurity <- function(cluster_labels) {
  p <- prop.table(table(cluster_labels))
  1 - sum(p^2)
}

gini_by_cluster <- df %>%
  group_by(cluster) %>%
  summarise(gini = gini_impurity(label))

print(gini_by_cluster)

```


## Entropy as a Cluster Purity Measure

The entropy of a cluster \( C \) with respect to class distribution is defined as:

\[
\text{Entropy}(C) = - \sum_{k=1}^{K} p_k \log_2(p_k)
\]

where \( p_k \) is the proportion of class \( k \) in cluster \( C \).

**Interpretation:**

- If all points in a cluster belong to the same class: \( \text{Entropy} = 0 \)
- If classes are equally mixed: \( \text{Entropy} = \log_2(K) \)


To compute the total entropy across all clusters:

\[
\text{Total Entropy} = \sum_{i=1}^{M} \frac{n_i}{n} \cdot \text{Entropy}(C_i)
\]

where:

- \( M \): total number of clusters
- \( n_i \): number of items in cluster \( C_i \)
- \( n \): total number of data points

```{r}
# Assume df has 'label' and 'cluster' columns as before
library(dplyr)

# Entropy function
entropy <- function(cluster_labels) {
  p <- prop.table(table(cluster_labels))
  -sum(p * log2(p), na.rm = TRUE)
}

# Calculate entropy per cluster
entropy_by_cluster <- df %>%
  group_by(cluster) %>%
  summarise(entropy = entropy(label),
            n = n()) %>%
  mutate(weight = n / sum(n),
         weighted_entropy = entropy * weight)

# Total entropy
total_entropy <- sum(entropy_by_cluster$weighted_entropy)
print(entropy_by_cluster)
print(paste("Total Entropy:", round(total_entropy, 4)))

```
## CART: Classification and Regression Trees


### Explanation
CART is a decision tree learning technique that can be used for both classification (categorical outcome) and regression (continuous outcome). It works by recursively splitting the data into subsets based on the feature that yields the highest purity gain.

In classification, the algorithm:

- Starts with all the data at the root node.

- Splits it using the variable and threshold that maximizes the decrease in impurity (e.g., Gini impurity or Entropy).

- Repeats the process on each subset until a stopping criterion is met (e.g., minimum samples per node or max depth).

## Maths:

Let $D$ be the dataset at a node, with $K$ classes. The Gini impurity is defined as:

\[
Gini(D) = 1 - \sum_{k=1}^K p_k^2
\]

where $p_k$ is the proportion of observations in class $k$ at the node.

Given a split $s$ that partitions $D$ into two subsets $D_1$ and $D_2$, the impurity of the split is:

\[
Gini_{split}(s) = \frac{|D_1|}{|D|} Gini(D_1) + \frac{|D_2|}{|D|} Gini(D_2)
\]

The best split minimizes this impurity.

### Example in R

```{r}
# Load required library
library(rpart)
library(rpart.plot)

# Load a sample dataset
data(iris)

# Convert Species to binary classification (e.g., setosa vs. others)
iris$BinarySpecies <- ifelse(iris$Species == "setosa", "setosa", "other")
iris$BinarySpecies <- as.factor(iris$BinarySpecies)

# Build a CART classification model
cart_model <- rpart(BinarySpecies ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
                    data = iris, method = "class")

# Print model summary
print(cart_model)

# Plot the tree
rpart.plot(cart_model, type = 4, extra = 104, fallen.leaves = TRUE)

```
### Notes:

- method = "class" ensures it performs classification.
- The impurity metric used by default is Gini impurity.
- The model chooses splits that minimize the weighted sum of child node impurities.



### CART Classification: Entropy

Given a dataset $D$ with $K$ classes, the entropy is defined as:

\[
Entropy(D) = - \sum_{k=1}^K p_k \log_2 p_k
\]

where $p_k$ is the proportion of class $k$ in $D$.

For a split $s$ that divides $D$ into $D_1$ and $D_2$, the entropy after the split is:

\[
Entropy_{split}(s) = \frac{|D_1|}{|D|} Entropy(D_1) + \frac{|D_2|}{|D|} Entropy(D_2)
\]

The information gain is:

\[
Gain(s) = Entropy(D) - Entropy_{split}(s)
\]

The best split maximizes $Gain(s)$.

```{r}
library(RWeka)

# Convert Species to binary classification
data(iris)
iris$BinarySpecies <- ifelse(iris$Species == "setosa", "setosa", "other")
iris$BinarySpecies <- as.factor(iris$BinarySpecies)

# Build an entropy-based decision tree using J48 (C4.5)
model_entropy <- J48(BinarySpecies ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = iris)

# Print the model
summary(model_entropy)
```


## HDBSCAN: Hierarchical Density-Based Clustering

Overview

- HDBSCAN is an advanced clustering method that:

- Extends DBSCAN by converting it into a hierarchical clustering algorithm.

- Extracts the most stable clusters from the hierarchy.

- Automatically determines the number of clusters and handles noise/outliers effectively.

It is non-parametric (except for minPts) and works well for clusters of varying density, unlike DBSCAN.

**HDBSCAN Clustering**

Let $x, y \in \mathbb{R}^d$ be data points. Define:

**1. Core distance:**

\[
\text{core}_{k}(x) = \text{distance to the } k\text{-th nearest neighbor of } x
\]

** 2. Mutual Reachability Distance (MRD):

\[
\text{MRD}(x, y) = \max \left\{ \text{core}_k(x), \text{core}_k(y), \|x - y\| \right\}
\]

**3. Cluster hierarchy:**

Build a minimum spanning tree (MST) from all pairwise MRDs. Perform single-linkage clustering to obtain a hierarchy.

**4. Stability:**

Let $C$ be a cluster, and $\lambda = \frac{1}{\text{MRD}}$. Define the cluster stability as:

\[
\text{Stability}(C) = \sum_{x \in C} (\lambda_{\text{birth}}(x) - \lambda_{\text{death}}(x))
\]

Only the most stable clusters are retained.

### Example

```{r}
# Install required package if not already
# install.packages("dbscan")

library(dbscan)
library(ggplot2)

# Simulated 2D data
set.seed(42)
n <- 100
x <- cbind(
  x = c(rnorm(n, 0, 0.3), rnorm(n, 3, 0.3)),
  y = c(rnorm(n, 0, 0.3), rnorm(n, 3, 0.3))
)

# HDBSCAN clustering
hdb <- hdbscan(x, minPts = 10)

# Print results
print(hdb)

# Plot clusters
df <- data.frame(x, cluster = factor(hdb$cluster))
ggplot(df, aes(x = x, y = y, color = cluster)) +
  geom_point(size = 3) +
  ggtitle("HDBSCAN Clustering") +
  theme_minimal()

```




### When to Use HDBSCAN

- When clusters have different densities.

- When number of clusters is unknown.

- When noise/outliers should be automatically detected.





