1. Introduction

This project aims to apply various clustering techniques to the Global Crocodile Species Dataset. Clustering is an unsupervised learning method used to discover hidden structures in data. In biological contexts, clustering helps identify distinct growth stages or species variations based on physical measurements like length and weight.

2. Review of the dataset

The dataset used in this project was found on kaggle. It provides detailed information on all recognized crocodile species across the globe. The columns of the dataset are as following:
  • Observation ID
  • Common Name
  • Scientific Name
  • Family
  • Genus
  • Observed Length (m)
  • Observed Weight (kg)
  • Age Class
  • Sex
  • Date of Observation
  • Country/Region
  • Habitat Type
  • Conservation Status
  • Observer Name
  • Notes
There are lots of columns, that can be used in various ways. However, for clustering, we focus on two primary numeric variables:
  • Observed Length (m)
  • Observed Weight (kg)

3. Clustering Methods Overview

3.1 K-Means

K-means is the most widely used partitioning algorithm. It works by defining \(k\) centroids (one for each cluster) and then assigning every data point to the nearest center using Euclidean distance.
The Goal: To minimize the Within-cluster Sum of Squares (WSS), ensuring that crocodiles within a group are as similar as possible.
In this Project: It serves as our baseline model to see how well the data can be divided into distinct size classes.

3.2 PAM (Partitioning Around Medoids)

PAM is a more robust version of K-means. Instead of using the “mean” (an average point that might not actually exist), it uses Medoids—actual crocodiles from the dataset that are the most central in their group.
Why it matters: K-means is sensitive to extreme outliers (like a “giant” crocodile). PAM ignores these “noise” points and focuses on the most representative individuals.
In this Project: We use it to ensure our clusters aren’t being skewed by a few unusually large or small animals.

3.3 CLARA (Clustering Large Applications)

CLARA is an extension of PAM designed for high-volume data. Instead of calculating distances for the entire dataset at once (which is slow), it takes several random samples, finds the medoids for those samples, and then selects the best set for the whole population.
The Goal: To provide the robustness of PAM while maintaining the speed required for large datasets.
In this Project: It demonstrates the scalability of our clustering approach for larger biological databases.

3.4 Fuzzy C-Means (The “Soft” Approach)

In traditional clustering, a crocodile belongs to only one group. Fuzzy C-Means (FCM) introduces probabilistic membership. Every individual is assigned a weight (from 0 to 1) for every cluster.
Why it’s the extra piece: Biological growth is a spectrum. A crocodile might be 70% “Adult” but still retain 30% “Sub-adult” characteristics.
In this Project: FCM provides a more realistic representation of nature by identifying crocodiles that are currently in a “transitional phase” between life stages.

3.5 Hierarchical Clustering (AGNES)

Unlike partitioning, Hierarchical clustering does not require us to choose \(k\) at the start. It uses an Agglomerative (bottom-up) approach, where every crocodile starts in its own cluster, and the most similar pairs are slowly merged until only one giant group remains.
The Dendrogram: The result is a tree-like diagram that shows exactly how groups are related.
In this Project: It allows us to see if there are “sub-clusters” within our main groups (e.g., separating small hatchlings from slightly larger juveniles).

Comparison of Clustering Methods

Table 1: Pros and Cons of Selected Clustering Algorithms
Method Pros Cons
K-means Fast and efficient; easy to interpret. Sensitive to outliers; assumes spherical shapes.
PAM (Medoids) Robust to outliers; uses actual data points as centers. Slower than K-means on large data.
CLARA Excellent for very large datasets via sampling. Accuracy depends on sampling quality.
Fuzzy C-Means Reflects biological reality; handles overlapping groups. Harder to interpret; requires a fuzziness parameter.
Hierarchical No need to pre-specify k; shows nested relationships. Slow on large datasets; cannot undo merges.

Why use multiple methods?

Partitioning (K-means, PAM, CLARA): We use these to find the most efficient “hard” groups for classification.
Fuzzy C-Means (The “Extra”): We use this to address the “ambiguity” in biological growth. Since crocodiles grow continuously, some individuals fall exactly between two categories. Standard k-means would force them into one, while Fuzzy clustering gives them a percentage of membership in both.
Hierarchical: We use this to see the nested relationships—for example, if certain sub-adults are closer to the “Juvenile” group or the “Adult” group.

4. Clustering Data

4.1 Preparation

Firstly we need to load some libraries

# read packages
library(tidyverse)  # for data manipulation
library(factoextra) # for visualization and Hopkins stat
library(corrplot)
library(cluster)
library(fclust) # For Fuzzy Clustering
library(factoextra)
library(ggplot2)
library(gridExtra)
library(e1071)
library(reshape2)

Then we perform preliminary analysis and data cleaning

# Load data
crocodile <- read.csv("crocodile_dataset.csv")

#Clearing Data: Select numeric features

crocodile <- crocodile %>% mutate(across(where(is.character), as.factor))
str(crocodile)
crocodile_num <- crocodile %>% select_if(is.numeric) %>% na.omit()

summary(crocodile_num)

4.2 Summary Statistics of Crocodile Measurements

Table 2: Descriptive Statistics of the Crocodile Dataset
Measurements
Statistic Obs. ID Length (m) Weight (kg)
Minimum 1.0 0.140 4.40
1st Quartile 250.8 1.637 53.23
Median 500.5 2.430 100.60
Mean 500.5 2.415 155.77
3rd Quartile 750.2 3.010 168.88
Maximum 1000.0 6.120 1139.70

Figure 1. Density plots of Length and Weight The table above leads to three main conclusions:

  1. The Growth Gradient
    The summary statistics reveal a significant range in crocodile sizes, from hatchlings (0.14m) to massive adults (6.12m). This confirms that our clustering approach is not just identifying different species, but likely capturing the biological growth stages of these reptiles. The fact that the Mean (2.41m) and Median (2.43m) for length are so close suggests a relatively balanced distribution in terms of length.

  2. The Weight Skewness & Outliers
    There is a massive disparity in the weight variable. While the median crocodile weighs 100.6 kg, the maximum weight reaches 1,139.7 kg. This indicates a heavy right-skew in the distribution, as confirmed by the figure above. In clustering, such extreme values (outliers) can pull the centroids of a standard K-means algorithm away from the true center of the clusters. This justifies our decision to include PAM (Medoids), which is mathematically more robust to these extreme weight values.

  3. Normalization
    The units of measurement differ significantly: length is measured in small units (0.14 to 6.12), while weight is measured in hundreds or thousands (4.4 to 1139.7). Without normalization, the weight variable would dominate the distance calculations, making the length variable practically invisible to the algorithm. Scaling the data ensures both features contribute equally to the final cluster assignments

4.3 Correlation between variables

Lastly, let’s check correlation between variables.

Figure 2. Correlation between variables

As shown, there is a strong correlation between weight and length.

4.4 Normalization


Based on the summary statistics, normalization is crucial for our data. This ensures that features with larger magnitudes, such as Weight, do not dominate features with smaller scales, like Length.

crocodile_scaled <- scale(crocodile_num)

4.5 Hopkins Statistic


Hopkins statistic shows the general clustering tendency. However, normalization is mandatory! The Hopkins statistic measures the “clusterability” of a dataset. If the data is not normalized, the large variance of one variable can mask the underlying structure of others, making the dataset appear more uniform (random) than it actually is.
To show an importance of normalization, below we can see two hopkin statistics before and after normalization.

# Non-Scaled data
get_clust_tendency(crocodile_num, 
                   n = 50,
                   graph = T)
# Non-Scaled data
## $hopkins_stat
## [1] 0.7890858

Figure 2. Hopkins plot for non-scaled data

# Scaled data
get_clust_tendency(crocodile_scaled, 
                   n = 50,
                   graph = T)
# Non-Scaled data
## $hopkins_stat
## [1] 0.9092721

Figure 3. Hopkins plot for scaled data

The outputs of these two Hopkins statistics differ noticeably: the value was 0.7890858 before normalization and 0.9092721 after. Despite this variation, both results indicate a significant clustering tendency, as according to the R documentation, any value above 0.5 suggests that the data is clusterable.

5. Optimal Number of Clusters

To obtain optimal number of clusters for K-means and PAM, three methods are implemented. It is chosen based on silhouette, WSS (elbow) and Gap statistics.

Figure 4. Optimal number of clusters using K-means

Figure 5. Optimal number of clusters using PAM Conclusion on Cluster Selection
Based on the visualizations above, 3 clusters were chosen as the optimal number for this project because:
  • The Silhouette Method: In both the K-means and PAM plots, we see a distinct drop-off in the average silhouette width after \(k=2\) and \(k=3\). While \(k=2\) shows a high score, \(k=3\) maintains a strong score while providing more granular biological insight.
  • The Elbow Method: Both plots show a clear “bend” or elbow at k=3. This indicates that adding a fourth cluster does not significantly reduce the Total Within-cluster Sum of Squares (WSS), making \(k=3\) the point of diminishing returns.
  • Biological Alignment: \(k=3\) aligns perfectly with the three primary life stages of crocodiles: Juveniles, Sub-adults, and Adults. Choosing \(k=1\) (as suggested by the Gap Statistic) would ignore the clear physical differences found in our distribution analysis.

6. Execution of methods

After determining the optimal number of clusters, we executed three partitioning methods: K-means, PAM, and CLARA. All three methods successfully identified three distinct segments within the crocodile dataset based on Principal Component Analysis (PCA) dimensions.

6.1 K-means

kmeans <- kmeans(crocodile_scaled,
                 centers = k,
                 nstart = 25)
fviz_cluster(kmeans,
             data = crocodile_scaled,
             main = "K-means")

Figure 6. K-means with 3 clusters This method partitioned the data into three groups with clear boundaries. The clusters represent low, medium, and high values across the first two principal components. K-means is efficient for this 1,000-observation dataset but is sensitive to the extreme weight outliers identified earlier in the analysis.

6.2 PAM

pam <- pam(crocodile_scaled,
           k)
fviz_cluster(pam,
             data = crocodile_scaled,
             main = "PAM")

Figure 7. PAM with 3 clusters

The PAM results show a very similar structure to K-means. However, because PAM uses actual data points (medoids) as cluster centers, it is more robust to the “giant” crocodile outliers in our data. This leads to a more stable representative “profile” for each of the three life stages.

6.3 CLARA

clara <- clara(crocodile_scaled,
             k)
fviz_cluster(clara,
             data = crocodile_scaled,
             main = "CLARA")

Figure 8. CLARA with 3 clusters

CLARA, designed for larger datasets, produced clusters nearly identical to PAM. This consistency validates that our data has a strong natural structure, as the sampling-based approach of CLARA reached the same conclusion as the more exhaustive PAM method.

6.4 C-means

fuzzy <- cmeans(crocodile_scaled,
                centers = k,
                m = 2) 
fviz_cluster(list(data = crocodile_scaled,
                  cluster = fuzzy$cluster),
                  main = "Fuzzy Clustering - Overlaping regions")

Figure 9. C-means with 3 clusters

head(fuzzy$membership)

Fuzzy C-Means Membership Matrix (First 6 Observations)

Below is the membership degree table for the first six crocodiles. This illustrates the “soft” nature of the algorithm, where individuals can have partial membership in multiple groups.

Table 3: Membership Degrees for Fuzzy C-Means Clustering
Crocodile ID Membership Cluster 1 Membership Cluster 2 Membership Cluster 3
1 0.840 0.109 0.051
2 0.331 0.183 0.486
3 0.765 0.161 0.074
4 0.820 0.116 0.064
5 0.444 0.201 0.355
6 0.775 0.135 0.091

Figure 10. Membership Degrees for Fuzzy C-Means Clustering FuzzyMem.png.png)

  • Cluster Certainty: “For most observations, such as ID 1 and ID 4, the algorithm shows high confidence (>80%) in assigning the individual to Cluster 1. These represent typical ‘core’ members of that group.”
  • Biological Ambiguity: “Observations like ID 2 and ID 5 show much higher levels of ambiguity. ID 2 has a 33% membership in Cluster 1 and a 48.6% membership in Cluster 3. In biological terms, this crocodile likely possesses physical traits (Length/Weight) that place it exactly on the boundary between a juvenile and a sub-adult.”
  • Advantage of Fuzzy Methods: “This table proves why Fuzzy C-Means is superior for biological data: it doesn’t hide the uncertainty of ‘transitional’ individuals, which would otherwise be forced into a single category by K-means or PAM.”

6.5 Hierarchical Clustering

dist <- dist(crocodile_scaled, 
             method = "euclidean")
histogram <- hclust(dist, 
                    method = "ward.D2")
# Dendrogram
fviz_dend(histogram, 
          k = k, 
          rect = TRUE, 
          main = "Hierarchical Clustering Dendrogram")

Figure 11. Hierarchical Clustering Dendogram

By applying a “cut” to the dendrogram at a height of approximately 20 (represented by the dashed gray boxes), we successfully produce three stable clusters. This matches the optimal \(k\) found in our previous Silhouette and Elbow analyses, providing strong cross-method validation for our results.

Biological Insights
From a biological standpoint, this dendrogram suggests that while there are three main life stages, some individuals are much more closely related than others. For example, the red cluster appears to be more physically distinct from the other two groups, as it branches off first. The green and blue clusters share a closer branch, indicating they are more similar to each other in proportions than they are to the hatchlings.

7. Summary & Validation

7.1 Comparing Silhouette Scores

The table below summarizes the Average Silhouette Width for each clustering method applied to the crocodile dataset. This metric serves as our primary tool for determining which algorithm produced the most distinct and cohesive groups. The score ranges from -1 to +1.
Close to +1: Indicates that the data point is very well clustered and far from neighboring clusters.
Close to 0: Indicates that the data point is on or very close to the decision boundary between two neighboring clusters (overlapping).
Negative values: Suggest that the data point might have been assigned to the wrong cluster.

Table 4: Comparative Average Silhouette Scores
Clustering Method Average Silhouette Width
CLARA 0.529
K-means 0.384
PAM 0.383
C-means 0.382

Based on the results:

  • CLARA is the Winner (0.529): A silhouette width above 0.5 is generally considered a “reasonable structure” in data science. This suggests that CLARA’s sampling-based approach was the most effective at finding distinct, non-overlapping groups among the crocodiles.
  • K-means, PAM, and C-means (~0.38): These scores are lower, falling into the 0.26 - 0.50 range, which is often described as a “weak or moderate structure.” * The Biological Meaning: The fact that most methods hover around 0.38 confirms our earlier discussion: crocodiles grow continuously. There are no “empty gaps” between a large juvenile and a small sub-adult. These lower scores are a mathematical reflection of nature’s transition zones.
  • Consistency: The fact that K-means, PAM, and C-means have nearly identical scores (0.383 vs 0.382) shows that the three-cluster structure is very stable, even if the boundaries between the groups are “fuzzy.”

7.2 The Biological Case for Fuzzy C-Means

While the silhouette analysis identified CLARA as the mathematically strongest partition, the Fuzzy C-Means approach is arguably the most biologically accurate. In nature, crocodiles do not jump instantly from one life stage to another; their growth is a continuous gradient.

As seen in the boxplots below, there is a significant “grey area” where the physical measurements of the clusters overlap.

Figure 12. Boxplots of Length and Weight

Why Overlap Justifies Fuzzy Clustering?

The Continuum of Growth: The overlap in the “whisker” and “box” regions between Cluster 1 and Cluster 2 proves that a crocodile of ~2.5 meters could logically belong to either group depending on its health and age.

Handling Ambiguity: “Hard” clustering (K-means/PAM) would force these overlapping individuals into one group, potentially losing information about their transitional status.

Membership Degrees: Because our Fuzzy C-Means membership matrix showed several individuals with ~40-50% membership in two different groups, it confirms that the model is correctly identifying these “middle-ground” crocodiles.

Conclusion: For biological classification, the ability of Fuzzy C-Means to handle these overlapping distributions makes it the superior choice for reflecting the true lifecycle of the species, despite what hard silhouette scores might suggest.

8. Conclusion

This project provides a comprehensive comparison of partitioning, hierarchical, and soft clustering techniques applied to the Global Crocodile Species dataset. By synthesizing mathematical validation with biological domain knowledge, we reached a more nuanced understanding of the data than any single algorithm could provide.

Synthesizing Statistical and Biological Evidence
While statistical metrics such as the Gap Statistic suggested a lack of distinct clusters (k=1), the Elbow and Silhouette methods consistently pointed toward k=3. This numerical evidence aligns perfectly with the three primary life stages of crocodiles: Juveniles, Sub-adults, and Mature Adults. The discrepancy in the Gap Statistic highlights a key lesson in data science: biological data often exists on a continuum of growth rather than in isolated “islands,” making it appear uniform to certain rigorous tests.

The Superiority of Fuzzy C-Means in Biological Contexts

The most significant finding was the comparative performance of “hard” versus “soft” clustering:

CLARA and K-means achieved the highest mathematical separation (Silhouette scores up to 0.53), but they forced artificial boundaries on a continuous growth process.

Fuzzy C-Means (C-means), although showing a slightly lower silhouette score (0.38), proved to be the most realistic model. The overlapping boxplots for length and weight serve as definitive proof that physical dimensions between growth stages are not mutually exclusive.

The membership degrees identified “ambiguous” individuals—crocodiles that are transitioning between stages—which would have been misclassified or simplified by hard partitioning.

In conclusion, while CLARA is the most efficient for purely statistical segmentation, Fuzzy C-Means is the superior choice for biological research. It successfully bridges the gap between the Gap Statistic’s suggestion of a “continuous cloud” and the practical need for categorical growth stages. This analysis demonstrates that meaningful data science requires a balance of robust algorithms, visual validation (Boxplots/Dendrograms), and critical interpretation of the subject matter.