This project aims to apply various clustering techniques to the Global Crocodile Species Dataset. Clustering is an unsupervised learning method used to discover hidden structures in data. In biological contexts, clustering helps identify distinct growth stages or species variations based on physical measurements like length and weight.
K-means is the most widely used partitioning algorithm. It works by
defining \(k\) centroids (one for each
cluster) and then assigning every data point to the nearest center using
Euclidean distance.
The Goal: To minimize the Within-cluster
Sum of Squares (WSS), ensuring that crocodiles within a group are as
similar as possible.
In this Project: It serves as our
baseline model to see how well the data can be divided into distinct
size classes.
PAM is a more robust version of K-means. Instead of using the “mean”
(an average point that might not actually exist), it uses Medoids—actual
crocodiles from the dataset that are the most central in their group.
Why it matters: K-means is sensitive to extreme outliers
(like a “giant” crocodile). PAM ignores these “noise” points and focuses
on the most representative individuals.
In this Project: We
use it to ensure our clusters aren’t being skewed by a few unusually
large or small animals.
CLARA is an extension of PAM designed for high-volume data. Instead
of calculating distances for the entire dataset at once (which is slow),
it takes several random samples, finds the medoids for those samples,
and then selects the best set for the whole population.
The
Goal: To provide the robustness of PAM while maintaining the speed
required for large datasets.
In this Project: It demonstrates
the scalability of our clustering approach for larger biological
databases.
In traditional clustering, a crocodile belongs to only one group.
Fuzzy C-Means (FCM) introduces probabilistic membership. Every
individual is assigned a weight (from 0 to 1) for every cluster.
Why it’s the extra piece: Biological growth is a
spectrum. A crocodile might be 70% “Adult” but still retain 30%
“Sub-adult” characteristics.
In this Project: FCM provides a
more realistic representation of nature by identifying crocodiles that
are currently in a “transitional phase” between life stages.
Unlike partitioning, Hierarchical clustering does not require us to
choose \(k\) at the start. It uses an
Agglomerative (bottom-up) approach, where every crocodile starts in its
own cluster, and the most similar pairs are slowly merged until only one
giant group remains.
The Dendrogram: The result is a
tree-like diagram that shows exactly how groups are related.
In
this Project: It allows us to see if there are “sub-clusters” within
our main groups (e.g., separating small hatchlings from slightly larger
juveniles).
| Method | Pros | Cons |
|---|---|---|
| K-means | Fast and efficient; easy to interpret. | Sensitive to outliers; assumes spherical shapes. |
| PAM (Medoids) | Robust to outliers; uses actual data points as centers. | Slower than K-means on large data. |
| CLARA | Excellent for very large datasets via sampling. | Accuracy depends on sampling quality. |
| Fuzzy C-Means | Reflects biological reality; handles overlapping groups. | Harder to interpret; requires a fuzziness parameter. |
| Hierarchical | No need to pre-specify k; shows nested relationships. | Slow on large datasets; cannot undo merges. |
Partitioning (K-means, PAM, CLARA): We use these to find the
most efficient “hard” groups for classification.
Fuzzy C-Means
(The “Extra”): We use this to address the “ambiguity” in biological
growth. Since crocodiles grow continuously, some individuals fall
exactly between two categories. Standard k-means would force them into
one, while Fuzzy clustering gives them a percentage of membership in
both.
Hierarchical: We use this to see the nested
relationships—for example, if certain sub-adults are closer to the
“Juvenile” group or the “Adult” group.
Firstly we need to load some libraries
# read packages
library(tidyverse) # for data manipulation
library(factoextra) # for visualization and Hopkins stat
library(corrplot)
library(cluster)
library(fclust) # For Fuzzy Clustering
library(factoextra)
library(ggplot2)
library(gridExtra)
library(e1071)
library(reshape2)Then we perform preliminary analysis and data cleaning
| Statistic | Obs. ID | Length (m) | Weight (kg) |
|---|---|---|---|
| Minimum | 1.0 | 0.140 | 4.40 |
| 1st Quartile | 250.8 | 1.637 | 53.23 |
| Median | 500.5 | 2.430 | 100.60 |
| Mean | 500.5 | 2.415 | 155.77 |
| 3rd Quartile | 750.2 | 3.010 | 168.88 |
| Maximum | 1000.0 | 6.120 | 1139.70 |
Figure 1. Density plots of Length and Weight The table above leads to three main
conclusions:
The Growth Gradient
The summary statistics reveal a
significant range in crocodile sizes, from hatchlings (0.14m) to massive
adults (6.12m). This confirms that our clustering approach is not just
identifying different species, but likely capturing the biological
growth stages of these reptiles. The fact that the Mean (2.41m) and
Median (2.43m) for length are so close suggests a relatively balanced
distribution in terms of length.
The Weight Skewness & Outliers
There is a massive
disparity in the weight variable. While the median crocodile weighs
100.6 kg, the maximum weight reaches 1,139.7 kg. This indicates a heavy
right-skew in the distribution, as confirmed by the figure above. In
clustering, such extreme values (outliers) can pull the centroids of a
standard K-means algorithm away from the true center of the clusters.
This justifies our decision to include PAM (Medoids), which is
mathematically more robust to these extreme weight values.
Normalization
The units of measurement differ significantly:
length is measured in small units (0.14 to 6.12), while weight is
measured in hundreds or thousands (4.4 to 1139.7). Without
normalization, the weight variable would dominate the distance
calculations, making the length variable practically invisible to the
algorithm. Scaling the data ensures both features contribute equally to
the final cluster assignments
Lastly, let’s check correlation between variables.
Figure 2. Correlation between variables
As shown, there is a strong correlation between weight and length.
Based on the summary statistics, normalization is crucial for
our data. This ensures that features with larger magnitudes, such as
Weight, do not dominate features with smaller scales, like Length.
Hopkins statistic shows the general clustering tendency.
However, normalization is mandatory! The Hopkins statistic
measures the “clusterability” of a dataset. If the data is not
normalized, the large variance of one variable can mask the underlying
structure of others, making the dataset appear more uniform (random)
than it actually is.
To show an importance of normalization, below
we can see two hopkin statistics before and after normalization.
Figure 2. Hopkins plot for non-scaled data
Figure 3. Hopkins plot for scaled data
The outputs of these two Hopkins statistics differ noticeably: the value was 0.7890858 before normalization and 0.9092721 after. Despite this variation, both results indicate a significant clustering tendency, as according to the R documentation, any value above 0.5 suggests that the data is clusterable.
To obtain optimal number of clusters for K-means and PAM, three methods are implemented. It is chosen based on silhouette, WSS (elbow) and Gap statistics.
Figure 4. Optimal number of clusters using K-means
After determining the optimal number of clusters, we executed three partitioning methods: K-means, PAM, and CLARA. All three methods successfully identified three distinct segments within the crocodile dataset based on Principal Component Analysis (PCA) dimensions.
kmeans <- kmeans(crocodile_scaled,
centers = k,
nstart = 25)
fviz_cluster(kmeans,
data = crocodile_scaled,
main = "K-means")Figure 6. K-means with 3 clusters This method partitioned the data into three
groups with clear boundaries. The clusters represent low, medium, and
high values across the first two principal components. K-means is
efficient for this 1,000-observation dataset but is sensitive to the
extreme weight outliers identified earlier in the analysis.
Figure 7. PAM with 3 clusters
The PAM results show a very similar structure to K-means. However, because PAM uses actual data points (medoids) as cluster centers, it is more robust to the “giant” crocodile outliers in our data. This leads to a more stable representative “profile” for each of the three life stages.
Figure 8. CLARA with 3 clusters
CLARA, designed for larger datasets, produced clusters nearly identical to PAM. This consistency validates that our data has a strong natural structure, as the sampling-based approach of CLARA reached the same conclusion as the more exhaustive PAM method.
fuzzy <- cmeans(crocodile_scaled,
centers = k,
m = 2)
fviz_cluster(list(data = crocodile_scaled,
cluster = fuzzy$cluster),
main = "Fuzzy Clustering - Overlaping regions")Figure 9. C-means with 3 clusters
Below is the membership degree table for the first six crocodiles. This illustrates the “soft” nature of the algorithm, where individuals can have partial membership in multiple groups.
| Crocodile ID | Membership Cluster 1 | Membership Cluster 2 | Membership Cluster 3 |
|---|---|---|---|
| 1 | 0.840 | 0.109 | 0.051 |
| 2 | 0.331 | 0.183 | 0.486 |
| 3 | 0.765 | 0.161 | 0.074 |
| 4 | 0.820 | 0.116 | 0.064 |
| 5 | 0.444 | 0.201 | 0.355 |
| 6 | 0.775 | 0.135 | 0.091 |
Figure 10. Membership Degrees for Fuzzy C-Means Clustering FuzzyMem.png.png)
dist <- dist(crocodile_scaled,
method = "euclidean")
histogram <- hclust(dist,
method = "ward.D2")
# Dendrogram
fviz_dend(histogram,
k = k,
rect = TRUE,
main = "Hierarchical Clustering Dendrogram")Figure 11. Hierarchical Clustering Dendogram
By applying a “cut” to the dendrogram at a height of approximately 20 (represented by the dashed gray boxes), we successfully produce three stable clusters. This matches the optimal \(k\) found in our previous Silhouette and Elbow analyses, providing strong cross-method validation for our results.
Biological Insights
From a biological standpoint, this
dendrogram suggests that while there are three main life stages, some
individuals are much more closely related than others. For example, the
red cluster appears to be more physically distinct from the other two
groups, as it branches off first. The green and blue clusters share a
closer branch, indicating they are more similar to each other in
proportions than they are to the hatchlings.
The table below summarizes the Average Silhouette Width for each
clustering method applied to the crocodile dataset. This metric serves
as our primary tool for determining which algorithm produced the most
distinct and cohesive groups. The score ranges from -1 to
+1.
Close to +1: Indicates that the data
point is very well clustered and far from neighboring clusters.
Close to 0: Indicates that the data point is on or
very close to the decision boundary between two neighboring clusters
(overlapping).
Negative values: Suggest that the
data point might have been assigned to the wrong cluster.
| Clustering Method | Average Silhouette Width |
|---|---|
| CLARA | 0.529 |
| K-means | 0.384 |
| PAM | 0.383 |
| C-means | 0.382 |
Based on the results:
While the silhouette analysis identified CLARA as the mathematically strongest partition, the Fuzzy C-Means approach is arguably the most biologically accurate. In nature, crocodiles do not jump instantly from one life stage to another; their growth is a continuous gradient.
As seen in the boxplots below, there is a significant “grey area” where the physical measurements of the clusters overlap.
Figure 12. Boxplots of Length and Weight
Why Overlap Justifies Fuzzy Clustering?
The Continuum of Growth: The overlap in the “whisker” and “box” regions between Cluster 1 and Cluster 2 proves that a crocodile of ~2.5 meters could logically belong to either group depending on its health and age.
Handling Ambiguity: “Hard” clustering (K-means/PAM) would force these overlapping individuals into one group, potentially losing information about their transitional status.
Membership Degrees: Because our Fuzzy C-Means membership matrix showed several individuals with ~40-50% membership in two different groups, it confirms that the model is correctly identifying these “middle-ground” crocodiles.
Conclusion: For biological classification, the ability of Fuzzy C-Means to handle these overlapping distributions makes it the superior choice for reflecting the true lifecycle of the species, despite what hard silhouette scores might suggest.
This project provides a comprehensive comparison of partitioning, hierarchical, and soft clustering techniques applied to the Global Crocodile Species dataset. By synthesizing mathematical validation with biological domain knowledge, we reached a more nuanced understanding of the data than any single algorithm could provide.
Synthesizing Statistical and Biological Evidence
While
statistical metrics such as the Gap Statistic suggested a lack of
distinct clusters (k=1), the Elbow and Silhouette methods consistently
pointed toward k=3. This numerical evidence aligns perfectly with the
three primary life stages of crocodiles: Juveniles, Sub-adults, and
Mature Adults. The discrepancy in the Gap Statistic highlights a key
lesson in data science: biological data often exists on a continuum
of growth rather than in isolated “islands,” making it appear uniform to
certain rigorous tests.
The Superiority of Fuzzy C-Means in Biological Contexts
The most significant finding was the comparative performance of “hard” versus “soft” clustering:
CLARA and K-means achieved the highest mathematical separation (Silhouette scores up to 0.53), but they forced artificial boundaries on a continuous growth process.
Fuzzy C-Means (C-means), although showing a slightly lower silhouette score (0.38), proved to be the most realistic model. The overlapping boxplots for length and weight serve as definitive proof that physical dimensions between growth stages are not mutually exclusive.
The membership degrees identified “ambiguous” individuals—crocodiles that are transitioning between stages—which would have been misclassified or simplified by hard partitioning.
In conclusion, while CLARA is the most efficient for purely statistical segmentation, Fuzzy C-Means is the superior choice for biological research. It successfully bridges the gap between the Gap Statistic’s suggestion of a “continuous cloud” and the practical need for categorical growth stages. This analysis demonstrates that meaningful data science requires a balance of robust algorithms, visual validation (Boxplots/Dendrograms), and critical interpretation of the subject matter.