Geostatistics with R: Clustering part 1(K Mean)

Install and Load Necessary Libraries

First, install and load the necessary libraries. We’ll use ggplot2 for plotting and cluster for clustering evaluation.

# Load the libraries
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.3

library(cluster)

## Warning: package 'cluster' was built under R version 4.3.3

Load and Prepare Geological Data

Assume we have a geological dataset with coordinates and some measurements (e.g., mineral composition, soil pH, etc.). For this example, we’ll create a synthetic dataset.

# Generate synthetic geological data
set.seed(123)
geo_data <- data.frame(
  Longitude = rnorm(100, mean = -100, sd = 0.5),
  Latitude = rnorm(100, mean = 40, sd = 0.5),
  Mineral_1 = rnorm(100, mean = 50, sd = 10),
  Mineral_2 = rnorm(100, mean = 30, sd = 5)
)

# View the first few rows of the dataset
head(geo_data)

##    Longitude Latitude Mineral_1 Mineral_2
## 1 -100.28024 39.64480  71.98810  26.42379
## 2 -100.11509 40.12844  63.12413  26.23656
## 3  -99.22065 39.87665  47.34855  25.30731
## 4  -99.96475 39.82623  55.43194  24.73743
## 5  -99.93536 39.52419  45.85660  27.81420
## 6  -99.14247 39.97749  45.23753  31.65590

Visualize the Data

Before clustering, it’s useful to visualize the data to understand its distribution.

# Plot the data
ggplot(geo_data, aes(x = Longitude, y = Latitude)) +
  geom_point(aes(color = Mineral_1, size = Mineral_2), alpha = 0.6) +
  scale_color_gradient(low = "blue", high = "red") +
  labs(title = "Geological Data Points", x = "Longitude", y = "Latitude", color = "Mineral 1", size = "Mineral 2") +
  theme_minimal()

Standardize the Data

Clustering algorithms like K-means perform better with standardized data.

# Standardize the data
geo_data_scaled <- scale(geo_data[, 3:4])

# View the first few rows of the scaled data
head(geo_data_scaled)

##       Mineral_1  Mineral_2
## [1,]  2.1880106 -0.6536692
## [2,]  1.2548418 -0.6897180
## [3,] -0.4059572 -0.8686293
## [4,]  0.4450345 -0.9783488
## [5,] -0.5630244 -0.3859683
## [6,] -0.6281979  0.3536857

Determine the Optimal Number of Clusters

Use the Elbow Method to determine the optimal number of clusters.

The Elbow Method is a heuristic approach used in cluster analysis to determine the optimal number of clusters in a dataset.

Let:

\(k\) be the number of clusters.
\(C_i\) be the set of data points assigned to cluster \(i\).
\(\mu_i\) be the centroid (mean) of cluster \(i\).

The Within-Cluster Sum of Squares (WCSS) for a given \(k\) is calculated as:

\[ \text{WCSS}(k) = \sum_{i=1}^{k} \sum_{\mathbf{x} \in C_i} ||\mathbf{x} - \mu_i||^2 \]

# Determine the optimal number of clusters using the Elbow Method
wss <- (nrow(geo_data_scaled) - 1) * sum(apply(geo_data_scaled, 2, var))
for (i in 2:10) {
  wss[i] <- sum(kmeans(geo_data_scaled, centers = i)$withinss)
}

# Plot the Elbow Method
plot(1:10, wss, type = "b", pch = 19, frame = FALSE,
     xlab = "Number of Clusters", ylab = "Total Within Sum of Squares",
     main = "Elbow Method for Determining Optimal Clusters")

The Elbow Method operates on the following principles:

Within-Cluster Sum of Squares (WCSS):
As the number of clusters increases, the WCSS, which measures the total within-cluster variance, generally decreases. This is because each additional cluster allows the model to better fit the data points within that cluster.

Diminishing Returns:

The rate at which WCSS decreases slows down as the number of clusters increases. Initially, adding clusters significantly improves the fit, but eventually, the gains become marginal.

The “Elbow” Point:

The optimal number of clusters is typically the point where the WCSS curve starts to flatten out, resembling an “elbow.” This suggests that adding more clusters beyond this point provides diminishing returns in terms of model fit.

Based on the Elbow Method, let’s choose an optimal number of clusters (e.g.,3 or 5 clusters).

# Apply K-means clustering with 3 clusters
set.seed(123)
kmeans_result <- kmeans(geo_data_scaled, centers = 3, nstart = 25)

# Add cluster results to the original data
geo_data$Cluster <- as.factor(kmeans_result$cluster)

# View the first few rows of the dataset with clusters
head(geo_data)

##    Longitude Latitude Mineral_1 Mineral_2 Cluster
## 1 -100.28024 39.64480  71.98810  26.42379       3
## 2 -100.11509 40.12844  63.12413  26.23656       3
## 3  -99.22065 39.87665  47.34855  25.30731       1
## 4  -99.96475 39.82623  55.43194  24.73743       3
## 5  -99.93536 39.52419  45.85660  27.81420       1
## 6  -99.14247 39.97749  45.23753  31.65590       2

Visualize the Clustering Result

Plot the data points with their cluster assignments.

# Plot the clustering result
ggplot(geo_data, aes(x = Longitude, y = Latitude)) +
  geom_point(aes(color = Cluster, shape = Cluster), size = 3) +
  labs(title = "Geological Data Clustering", x = "Longitude", y = "Latitude", color = "Cluster") +
  theme_minimal()

K-means clustering partitions a dataset into \(k\) clusters by minimizing the within-cluster sum of squares (WCSS):

\[\begin{equation} WCSS(k) = \sum_{i=1}^{k} \sum_{\mathbf{x} \in C_i} ||\mathbf{x} - \mu_i||^2 \end{equation}\]

The Elbow Method helps determine the optimal \(k\) by plotting WCSS against different values of \(k\) and visually identifying the “elbow” point, where the rate of WCSS decrease slows down significantly.

Algorithm Steps

(1)Initialization: Choose the number of clusters, (k), and randomly initialize (k) centroids (cluster centers).

Assignment: Assign each data point to the nearest centroid based on Euclidean distance.
Update: Recalculate the centroids as the mean of the data points assigned to each cluster.
Iteration: Repeat steps 2 and 3 until the centroids no longer change significantly or a maximum number of iterations is reached.

Silhouette analysis

Silhouette analysis assesses the quality of a clustering solution by quantifying how well each data point fits within its assigned cluster.

The silhouette coefficient \(s_i\) for a data point \(\mathbf{x}_i\) is calculated as:

\[\begin{equation} s_i = \frac{b_i - a_i}{\max(a_i, b_i)} \end{equation}\]

where:

\(a_i\) is the average distance between \(\mathbf{x}_i\) and all other data points within the same cluster.
\(b_i\) is the minimum average distance between \(\mathbf{x}_i\) and all data points in any other cluster.

The silhouette coefficient ranges from -1 to 1:

\(s_i\) close to 1: The data point is well-matched to its own cluster and poorly matched to neighboring clusters (good clustering).

\(s_i\) close to 0: The data point is on or very close to the decision boundary between two neighboring clusters (ambiguous clustering).

\(s_i\) close to -1: The data point may have been assigned to the wrong cluster (poor clustering)

The overall silhouette score \(\bar{s}\) is the average of the silhouette coefficients across all data points:

\[\begin{equation} \bar{s} = \frac{1}{N} \sum_{i=1}^N s_i \end{equation}\]

Interpretation ——————

Higher average silhouette scores indicate better-defined and well-separated clusters.

The silhouette plot visually displays the distribution of silhouette coefficients for each cluster, helping identify potential outliers or poorly assigned points.

Evaluate the Clustering Result

Use silhouette analysis to evaluate the clustering result.

# Compute silhouette scores
sil <- silhouette(kmeans_result$cluster, dist(geo_data_scaled))

# Plot the silhouette analysis
plot(sil, main = "Silhouette Analysis for K-means Clustering", col = 2:4, border = NA)

Explanation of Each Step

Loading Libraries:

ggplot2 is used for plotting, and cluster provides functions for clustering evaluation. Generating Synthetic Data: —————————

Creates a dataset mimicking geological measurements. Visualizing Data: —————–

Initial scatter plot to understand data distribution. Standardizing Data: ——————- Standardizes the measurements to ensure each feature contributes equally to the distance calculations in clustering. Elbow Method: ————- Helps determine the number of clusters by plotting the within-cluster sum of squares for different cluster counts. Applying K-means: —————– Performs K-means clustering with the chosen number of clusters. Visualizing Clusters: ——————–

Plots the clustered data to see the results of the K-means algorithm. Evaluating Clusters: ——————–

Uses silhouette analysis to evaluate how well-separated the clusters are.

This complete R script covers the basic steps needed to perform clustering on geological datasets and includes visualizations to help understand and validate the clustering results.