Geostatistics with R: Clustering part 2 (Advanced Clustering)

library(sp)       # Spatial Data Handling

## Warning: package 'sp' was built under R version 4.3.3

library(gstat)    # Geostatistics

## Warning: package 'gstat' was built under R version 4.3.3

library(dbscan)   # Density-Based Clustering

## Warning: package 'dbscan' was built under R version 4.3.3

## 
## Attaching package: 'dbscan'

## The following object is masked from 'package:stats':
## 
##     as.dendrogram

library(factoextra)# Cluster Validation and Visualization

## Warning: package 'factoextra' was built under R version 4.3.3

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 4.3.3

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(cluster)   # Clustering Algorithms

## Warning: package 'cluster' was built under R version 4.3.3

library(fpc)      # Cluster Validation

## Warning: package 'fpc' was built under R version 4.3.3

## 
## Attaching package: 'fpc'

## The following object is masked from 'package:dbscan':
## 
##     dbscan

Density-Based Spatial Clustering of Applications with Noise (DBSCAN):

DBSCAN is a density-based clustering algorithm that groups points based on their density.

\(\epsilon\) (eps): Radius of a neighborhood around a data point.
MinPts: Minimum number of points within the \(\epsilon\)-neighborhood for a point to be considered a core point.

Has at least MinPts points (including itself) within its \(\epsilon\)-neighborhood.
Has fewer than MinPts points within its \(\epsilon\)-neighborhood but is directly reachable from a core point.
Neither a core point nor a border point.

DBSCAN: Ideal for finding clusters of varying shapes and densities, commonly used to identify mineral deposit clusters or geological formations.

# Load spatial data (e.g., mineral deposit locations)
data(meuse)  
coordinates(meuse) <- ~x+y  # Convert to SpatialPointsDataFrame

# Choose DBSCAN parameters (eps: neighborhood distance, minPts: minimum points)
# DBSCAN using the correct function from the 'dbscan' package
db <- dbscan::dbscan(meuse@coords, eps = 200, minPts = 5)

# Visualize results (adjusting to use the new cluster assignments)
plot(meuse, col = db$cluster + 1, pch = 16)

A GMM represents a probability distribution as a weighted sum of Gaussian distributions: \[\begin{equation} p(\mathbf{x}) = \sum_{k=1}^{K} \pi_k \mathcal{N}(\mathbf{x} | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \end{equation}\] where:

Gaussian Mixture Models (GMM):

GMM: Models data as a mixture of Gaussian distributions, useful for identifying geochemical anomalies or clusters based on multiple attributes.

library(mclust)

## Warning: package 'mclust' was built under R version 4.3.3

## Package 'mclust' version 6.1.1
## Type 'citation("mclust")' for citing this R package in publications.

# Perform GMM clustering
gmm_model <- Mclust(meuse@data[, c("cadmium", "lead")])

# Visualize results
plot(gmm_model, what = "classification")

Code snippet

# Perform hierarchical clustering
dist_matrix <- dist(meuse@data[, c("cadmium", "lead")])  
hc <- hclust(dist_matrix, method = "ward.D2")

# Plot dendrogram
plot(hc, main = "Dendrogram")

Hierarchical Clustering:

Creates a hierarchical tree (dendrogram) representing the relationships between data points. You can cut the tree at a specific height to obtain clusters. Commonly used in stratigraphic correlation and mineral resource classification.

library(kohonen)

## Warning: package 'kohonen' was built under R version 4.3.3

## 
## Attaching package: 'kohonen'

## The following object is masked from 'package:mclust':
## 
##     map

# Train SOM
som_grid <- somgrid(xdim = 5, ydim = 5, topo = "hexagonal")
som_model <- som(scale(meuse@data[, c("cadmium", "lead")]), grid = som_grid)

# Visualize results
plot(som_model, type = "codes")

SOM: Unsupervised neural network that maps high-dimensional data onto a lower-dimensional grid, preserving topological relationships. Helpful for visualizing complex multivariate geological data and identifying patterns.

library(EMCluster)

## Warning: package 'EMCluster' was built under R version 4.3.3

## Loading required package: MASS

## Loading required package: Matrix

The emcluster function doesn’t handle missing data well. Here’s how to address it:

Check for Missing Values:

# Check for missing values in the relevant columns
any(is.na(meuse@data[, c("cadmium", "lead")]))

## [1] FALSE

If you find missing values, you have a few options:

Impute Missing Values:

Mean/Median Imputation:

Replace missing values with the mean or median of the respective column.

KNN Imputation:

Impute missing values using the k-nearest neighbors algorithm.

More Advanced Methods:

Consider multiple imputation or other sophisticated techniques depending on the nature of your data and the missingness pattern.

Here’s an example of mean imputation:

meuse_imputed <- meuse
meuse_imputed@data$cadmium[is.na(meuse_imputed@data$cadmium)] <- mean(meuse_imputed@data$cadmium, na.rm = TRUE)
meuse_imputed@data$lead[is.na(meuse_imputed@data$lead)] <- mean(meuse_imputed@data$lead, na.rm = TRUE)

You May Remove Rows with Missing Values:

meuse_complete <- meuse[!is.na(meuse@data$cadmium) & !is.na(meuse@data$lead), ]

Run emcluster on Complete Data:

# If you removed rows with missing values
scaled_data <- scale(meuse@data[, c("cadmium", "lead")])
em_model <- init.EM(scaled_data, nclass = 3)

# Visualize
plotcluster(meuse@data[, c("cadmium", "lead")], em_model$class)

Model-Based Clustering:

Assumes data points are generated from a mixture of underlying probability distributions, providing a probabilistic framework for cluster assignment. Often used for classifying rock types or identifying distinct groups in geochemical data.

library(EMCluster)

library(EMCluster, quietly = TRUE)
set.seed(1234)
x1 <- da1$da

emobj <- simple.init(x1, nclass = 10)
emobj <- shortemcluster(x1, emobj)
summary(emobj)

## Method: 
##  n = 500, p = 2, nclass = 10, flag = , total parameters = 59,
##  conv.iter = 12, conv.eps = 0.009409358,
##  logL = -5827.1582, AIC = 11772.3164, BIC = 12020.9783.
## nc: 
## [1] 10
## pi: 
##  [1] 0.07731 0.05203 0.01943 0.02477 0.19800 0.02424 0.29374 0.12548 0.02601
## [10] 0.15897

ret <- emcluster(x1, emobj, assign.class = TRUE)
summary(ret)

## Method: 
##  n = 500, p = 2, nclass = 10, flag = , total parameters = 59,
##  conv.iter = 56, conv.eps = 8.541177e-07,
##  logL = -5775.3087, AIC = 11668.6174, BIC = 11917.2793.
## nc: 
##  [1]  40  16  38  14  99  28 132  45  13  75
## pi: 
##  [1] 0.07765 0.03079 0.07566 0.02125 0.19800 0.05803 0.27991 0.08170 0.02491
## [10] 0.15211

Data Preparation:

Standardize or normalize your data if necessary, especially for distance-based methods.

Parameter Tuning:

Carefully choose the appropriate parameters for each algorithm (e.g., eps and minPts for DBSCAN).

Cluster Validation:

Use techniques like the Silhouette analysis or the Gap statistic to assess the quality of your clustering results.