First, install and load the necessary libraries. We’ll use ggplot2 for plotting and cluster for clustering evaluation.
# Load the libraries
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
library(cluster)
## Warning: package 'cluster' was built under R version 4.3.3
Assume we have a geological dataset with coordinates and some measurements (e.g., mineral composition, soil pH, etc.). For this example, we’ll create a synthetic dataset.
# Generate synthetic geological data
set.seed(123)
geo_data <- data.frame(
Longitude = rnorm(100, mean = -100, sd = 0.5),
Latitude = rnorm(100, mean = 40, sd = 0.5),
Mineral_1 = rnorm(100, mean = 50, sd = 10),
Mineral_2 = rnorm(100, mean = 30, sd = 5)
)
# View the first few rows of the dataset
head(geo_data)
## Longitude Latitude Mineral_1 Mineral_2
## 1 -100.28024 39.64480 71.98810 26.42379
## 2 -100.11509 40.12844 63.12413 26.23656
## 3 -99.22065 39.87665 47.34855 25.30731
## 4 -99.96475 39.82623 55.43194 24.73743
## 5 -99.93536 39.52419 45.85660 27.81420
## 6 -99.14247 39.97749 45.23753 31.65590
Before clustering, it’s useful to visualize the data to understand its distribution.
# Plot the data
ggplot(geo_data, aes(x = Longitude, y = Latitude)) +
geom_point(aes(color = Mineral_1, size = Mineral_2), alpha = 0.6) +
scale_color_gradient(low = "blue", high = "red") +
labs(title = "Geological Data Points", x = "Longitude", y = "Latitude", color = "Mineral 1", size = "Mineral 2") +
theme_minimal()
Clustering algorithms like K-means perform better with standardized data.
# Standardize the data
geo_data_scaled <- scale(geo_data[, 3:4])
# View the first few rows of the scaled data
head(geo_data_scaled)
## Mineral_1 Mineral_2
## [1,] 2.1880106 -0.6536692
## [2,] 1.2548418 -0.6897180
## [3,] -0.4059572 -0.8686293
## [4,] 0.4450345 -0.9783488
## [5,] -0.5630244 -0.3859683
## [6,] -0.6281979 0.3536857
Use the Elbow Method to determine the optimal number of clusters.
The Elbow Method is a heuristic approach used in cluster analysis to determine the optimal number of clusters in a dataset.
Let:
The Within-Cluster Sum of Squares (WCSS) for a given \(k\) is calculated as:
\[ \text{WCSS}(k) = \sum_{i=1}^{k} \sum_{\mathbf{x} \in C_i} ||\mathbf{x} - \mu_i||^2 \]
# Determine the optimal number of clusters using the Elbow Method
wss <- (nrow(geo_data_scaled) - 1) * sum(apply(geo_data_scaled, 2, var))
for (i in 2:10) {
wss[i] <- sum(kmeans(geo_data_scaled, centers = i)$withinss)
}
# Plot the Elbow Method
plot(1:10, wss, type = "b", pch = 19, frame = FALSE,
xlab = "Number of Clusters", ylab = "Total Within Sum of Squares",
main = "Elbow Method for Determining Optimal Clusters")
The Elbow Method operates on the following principles:
Within-Cluster Sum of Squares (WCSS):
As the number of clusters increases, the WCSS, which measures the total within-cluster variance, generally decreases. This is because each additional cluster allows the model to better fit the data points within that cluster.
Diminishing Returns:
The rate at which WCSS decreases slows down as the number of clusters increases. Initially, adding clusters significantly improves the fit, but eventually, the gains become marginal.
The “Elbow” Point:
The optimal number of clusters is typically the point where the WCSS curve starts to flatten out, resembling an “elbow.” This suggests that adding more clusters beyond this point provides diminishing returns in terms of model fit.
Based on the Elbow Method, let’s choose an optimal number of clusters (e.g.,3 or 5 clusters).
# Apply K-means clustering with 3 clusters
set.seed(123)
kmeans_result <- kmeans(geo_data_scaled, centers = 3, nstart = 25)
# Add cluster results to the original data
geo_data$Cluster <- as.factor(kmeans_result$cluster)
# View the first few rows of the dataset with clusters
head(geo_data)
## Longitude Latitude Mineral_1 Mineral_2 Cluster
## 1 -100.28024 39.64480 71.98810 26.42379 3
## 2 -100.11509 40.12844 63.12413 26.23656 3
## 3 -99.22065 39.87665 47.34855 25.30731 1
## 4 -99.96475 39.82623 55.43194 24.73743 3
## 5 -99.93536 39.52419 45.85660 27.81420 1
## 6 -99.14247 39.97749 45.23753 31.65590 2
Plot the data points with their cluster assignments.
# Plot the clustering result
ggplot(geo_data, aes(x = Longitude, y = Latitude)) +
geom_point(aes(color = Cluster, shape = Cluster), size = 3) +
labs(title = "Geological Data Clustering", x = "Longitude", y = "Latitude", color = "Cluster") +
theme_minimal()
K-means clustering partitions a dataset into \(k\) clusters by minimizing the within-cluster sum of squares (WCSS):
\[\begin{equation} WCSS(k) = \sum_{i=1}^{k} \sum_{\mathbf{x} \in C_i} ||\mathbf{x} - \mu_i||^2 \end{equation}\]
The Elbow Method helps determine the optimal \(k\) by plotting WCSS against different values of \(k\) and visually identifying the “elbow” point, where the rate of WCSS decrease slows down significantly.
(1)Initialization: Choose the number of clusters, (k), and randomly initialize (k) centroids (cluster centers).
Assignment: Assign each data point to the nearest centroid based on Euclidean distance.
Update: Recalculate the centroids as the mean of the data points assigned to each cluster.
Iteration: Repeat steps 2 and 3 until the centroids no longer change significantly or a maximum number of iterations is reached.
Silhouette analysis assesses the quality of a clustering solution by quantifying how well each data point fits within its assigned cluster.
The silhouette coefficient \(s_i\) for a data point \(\mathbf{x}_i\) is calculated as:
\[\begin{equation} s_i = \frac{b_i - a_i}{\max(a_i, b_i)} \end{equation}\]
where:
The silhouette coefficient ranges from -1 to 1:
\(s_i\) close to 1: The data point is well-matched to its own cluster and poorly matched to neighboring clusters (good clustering).
\(s_i\) close to 0: The data point is on or very close to the decision boundary between two neighboring clusters (ambiguous clustering).
\(s_i\) close to -1: The data point may have been assigned to the wrong cluster (poor clustering)
The overall silhouette score \(\bar{s}\) is the average of the silhouette coefficients across all data points:
\[\begin{equation} \bar{s} = \frac{1}{N} \sum_{i=1}^N s_i \end{equation}\]
Interpretation ——————
Higher average silhouette scores indicate better-defined and well-separated clusters.
The silhouette plot visually displays the distribution of silhouette coefficients for each cluster, helping identify potential outliers or poorly assigned points.
Use silhouette analysis to evaluate the clustering result.
# Compute silhouette scores
sil <- silhouette(kmeans_result$cluster, dist(geo_data_scaled))
# Plot the silhouette analysis
plot(sil, main = "Silhouette Analysis for K-means Clustering", col = 2:4, border = NA)
ggplot2 is used for plotting, and cluster provides functions for clustering evaluation. Generating Synthetic Data: —————————
Creates a dataset mimicking geological measurements. Visualizing Data: —————–
Initial scatter plot to understand data distribution. Standardizing Data: ——————- Standardizes the measurements to ensure each feature contributes equally to the distance calculations in clustering. Elbow Method: ————- Helps determine the number of clusters by plotting the within-cluster sum of squares for different cluster counts. Applying K-means: —————– Performs K-means clustering with the chosen number of clusters. Visualizing Clusters: ——————–
Plots the clustered data to see the results of the K-means algorithm. Evaluating Clusters: ——————–
Uses silhouette analysis to evaluate how well-separated the clusters are.
This complete R script covers the basic steps needed to perform clustering on geological datasets and includes visualizations to help understand and validate the clustering results.