Introduction

This report presents the clustering analysis of earthquake data using multiple clustering algorithms. The aim is to determine the best way to distinguish earthquakes by their features.

The analysis is divided into three main parts:

  1. Data Preprocesing - data cleanup, extraction of useful dimensions.
  2. Localization Clustering - Clustering based on latitude and longitude, performed to find out how earthquakes can be separated into groups - by the Equator, the Prime Meridian, both or by tectonic plates.
  3. Characteristics Clustering - Clustering based on depth, magnitude, and earthquake type, used to determine similar features of earthquakes.
  4. Summary

This analysis might be useful to see where on earth seismic activity is most common, which regions are particularly in danger of earthquakes and if it is possible to determine what type of earthquakes are common in those regions.

1. Data Preparation

# Load necessary libraries
library(dplyr)
library(cluster)
library(ggplot2)
library(hopkins)
library(factoextra)
library(maps)
library(ggspatial)
library(sf)
library(ClusterR)
library(purrr)
library(fastDummies)
library(fpc)

# World map with tectonic plates
world_map <- map_data("world")
tectonic_plates <- st_read("tectonic_plates.json", quiet=TRUE)

# Load the dataset
earthquakes_data <- read.csv("Significant_Earthquakes.csv")

# Check for missing values (NAs) and visualize the distribution
na_counts <- sapply(earthquakes_data, function(x) sum(is.na(x)))
data_na <- data.frame(Column = names(na_counts), NAs = as.numeric(na_counts))
ggplot(data_na, aes(x = reorder(Column, -NAs), y = NAs)) +
    geom_bar(stat = "identity", fill = "steelblue") +
    coord_flip() +
    labs(title = "Missing Values per Column", x = "Column", y = "Number of Missing Values") +
    theme_minimal()

# Drop columns with high missing values and those which are not important to this analysis
columns_to_drop <- c("nst", "gap", "dmin", "rms", "horizontalError", "depthError", "magError", "magNst", "net", "id", "updated")
earthquakes_data <- earthquakes_data %>% select(-all_of(columns_to_drop))
# Keep only rows where 'type' is 'earthquake'
earthquakes_data <- earthquakes_data %>% filter(type == "earthquake")
# Drop rows with missing depth values and depth smaller than 0 (possible errors in data)
earthquakes_data <- earthquakes_data %>% filter(!is.na(depth) & depth > 0)
# Handle magType and create one-hot encoding
selected_types <- c("mb", "mw", "mwc", "mww")
earthquakes_data$magType <- ifelse(earthquakes_data$magType %in% selected_types, earthquakes_data$magType, "other")
earthquakes_data <- dummy_cols(earthquakes_data, select_columns = "magType", remove_first_dummy = TRUE)
# Recheck structure and missing values after cleaning
str(earthquakes_data)
## 'data.frame':    104401 obs. of  16 variables:
##  $ X             : int  16 17 18 19 20 22 23 24 25 26 ...
##  $ time          : chr  "1904-04-04T10:26:00.880Z" "1904-04-04T10:02:34.560Z" "1904-06-25T21:00:38.720Z" "1904-06-25T14:45:39.140Z" ...
##  $ latitude      : num  41.8 41.8 52.8 51.4 30.7 ...
##  $ longitude     : num  23.2 23.1 160.3 161.6 100.6 ...
##  $ depth         : num  15 15 30 15 15 10 10 15 20 15 ...
##  $ mag           : num  7.02 6.84 7.7 7.5 7.09 7.29 6.72 6.49 6.74 6.51 ...
##  $ magType       : chr  "mw" "mw" "mw" "mw" ...
##  $ place         : chr  "7 km SE of Stara Kresna, Bulgaria" "6 km W of Stara Kresna, Bulgaria" "115 km ESE of Petropavlovsk-Kamchatsky, Russia" "274 km SE of Petropavlovsk-Kamchatsky, Russia" ...
##  $ type          : chr  "earthquake" "earthquake" "earthquake" "earthquake" ...
##  $ status        : chr  "reviewed" "reviewed" "reviewed" "reviewed" ...
##  $ locationSource: chr  "iscgem" "iscgem" "iscgem" "iscgem" ...
##  $ magSource     : chr  "iscgem" "iscgem" "iscgem" "iscgem" ...
##  $ magType_mw    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ magType_mwc   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ magType_mww   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ magType_other : int  0 0 0 0 0 0 0 0 0 0 ...
#Create a single sample for all clustering tasks
set.seed(123)
sample_size <- min(5000, nrow(earthquakes_data))
sample_data <- earthquakes_data %>% sample_n(sample_size)

2. Localization Clustering

sample_data_loc <- sample_data %>%
    select(latitude, longitude) %>%
    mutate(across(everything(), scale))

Hopkins statistic

hopkins_stat_loc <- hopkins(sample_data_loc)
print(paste("Hopkins statistic:", hopkins_stat_loc))
## [1] "Hopkins statistic: 0.994436368834284"

Interpretation of Hopkins statistic:

Hopkins statistic is almost equal to 1 which means that dataset contains meaningful clusters.

Finding the optimal cluster count

# Elbow 
elbow <-Optimal_Clusters_KMeans(sample_data_loc, max_clusters=20, plot_clusters=TRUE)

# Silhouette 
silhouette <-Optimal_Clusters_KMeans(sample_data_loc, max_clusters=20, plot_clusters=TRUE, criterion="silhouette")

# Gapstat
set.seed(123)
gap_stat <- clusGap(sample_data_loc, FUN = kmeans, nstart = 25, K.max = 20, B = 50)
print(gap_stat)
## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = sample_data_loc, FUNcluster = kmeans, K.max = 20, B = 50, nstart = 25)
## B=50 simulated reference sets, k = 1..20; spaceH0="scaledPCA"
##  --> Number of clusters (method 'firstSEmax', SE.factor=1): 12
##           logW   E.logW       gap      SE.sim
##  [1,] 7.691328 8.116664 0.4253360 0.005045375
##  [2,] 7.314691 7.850469 0.5357784 0.004791755
##  [3,] 7.025886 7.626101 0.6002147 0.004419831
##  [4,] 6.812315 7.422632 0.6103169 0.004631921
##  [5,] 6.691382 7.329808 0.6384266 0.004130681
##  [6,] 6.579577 7.241434 0.6618573 0.004200131
##  [7,] 6.459717 7.158640 0.6989238 0.004220733
##  [8,] 6.380249 7.081403 0.7011536 0.004458368
##  [9,] 6.281604 7.012974 0.7313704 0.004883287
## [10,] 6.194378 6.964655 0.7702772 0.004503494
## [11,] 6.096089 6.918437 0.8223472 0.004071655
## [12,] 6.025524 6.873271 0.8477469 0.003880096
## [13,] 5.990945 6.830904 0.8399588 0.003377480
## [14,] 5.953003 6.790193 0.8371898 0.003030882
## [15,] 5.905460 6.753117 0.8476569 0.003116920
## [16,] 5.873038 6.718503 0.8454649 0.003339987
## [17,] 5.824149 6.688701 0.8645518 0.003478658
## [18,] 5.779484 6.660232 0.8807478 0.003594103
## [19,] 5.753474 6.632535 0.8790611 0.003499432
## [20,] 5.713590 6.606125 0.8925349 0.003236956
fviz_gap_stat(gap_stat)

4 clusters were selected (12 were checked, but did not provide any better results). In Elbow and Silhouette methods 4 clusters proved to be a good fit, in Gap Statistics 12 clusters had the best score, but 4 clusters were quite high - 2/3 of maximus score (note the values on Y axis starting from 0.4).

num_of_clusters <- 4

K-Means for localization

PAM for localization

Hierarchical Clustering for localization

Interpretation of clusters

K-means and PAM exhibited similar clustering approaches, forming clusters strongly associated with geographic structures, particularly around the equator and the Prime Meridian. This consistency suggests that both algorithms effectively capture spatial patterns related to seismic activity near major tectonic plates. These clusters indicate that geographic location (longitude and latitude) plays a significant role in analyzing earthquake locations.

In contrast, Hierarchical Clustering demonstrates a more dispersed cluster structure, with groupings appearing more closely linked to continental boundaries. For instance, the cluster dominating South America in this method encompasses areas that were previously assigned to a cluster along the western coast of North America. This difference may indicate that hierarchical clustering places greater emphasis on global distances between data points rather than purely local spatial groupings.

These variations can be attributed to differences in the algorithms themselves – K-means and PAM rely on partitioning data based on centroids or medoids, while Hierarchical Clustering incorporates various aggregation criteria, resulting in a more hierarchical division of space.

These results suggest that the choice of clustering algorithm should depend on the specific analytical goal – whether the focus is on revealing local structures (K-means, PAM) or identifying broader global patterns and their connections (Hierarchical Clustering).

Calinski-Harabasz Indexes

## [1] "Calinski-Harabasz Index for K-Means (Localization): 6887.55"
## [1] "Calinski-Harabasz Index for PAM (Localization): 6837.86"
## [1] "Calinski-Harabasz Index for Hierarchical Clustering (Localization): 5926.14"

Calinski-Harabasz Index for K-Means is the highest which suggests that K-Means approach is the best one. Worth mentioning is relatively small difference between K-Means and PAM.

Clusters presented in 2D and additional stats

## [1] "Extended statistics for localization clusters (KMEANS):"
## # A tibble: 4 × 8
##   kmeans_loc_cluster mean_latitude min_latitude max_latitude mean_longitude
##                <int>         <dbl>        <dbl>        <dbl>          <dbl>
## 1                  1         32.4          1.70        71.9           -107.
## 2                  2         -8.67       -65.7         13.9            136.
## 3                  3        -28.4        -65.7          1.66          -113.
## 4                  4         35.1         11.0         85.7            111.
## # ℹ 3 more variables: min_longitude <dbl>, max_longitude <dbl>, count <int>

## [1] "Extended statistics for localization clusters (PAM):"
## # A tibble: 4 × 8
##   pam_loc_cluster mean_latitude min_latitude max_latitude mean_longitude
##             <int>         <dbl>        <dbl>        <dbl>          <dbl>
## 1               1        -29.6        -65.7         -3.85          -115.
## 2               2         36.6         13.7         85.7            109.
## 3               3         -7.82       -65.7         16.4            135.
## 4               4         29.1         -3.86        71.9           -105.
## # ℹ 3 more variables: min_longitude <dbl>, max_longitude <dbl>, count <int>

## [1] "Extended statistics for localization clusters (Hierarchical):"
## # A tibble: 4 × 8
##   hc_loc_cluster mean_latitude min_latitude max_latitude mean_longitude
##            <int>         <dbl>        <dbl>        <dbl>          <dbl>
## 1              1        -20.9         -65.7         27.4          -107.
## 2              2         36.8          14.9         85.7           105.
## 3              3         -7.84        -65.7         23.7           135.
## 4              4         50.5          29.0         71.9          -154.
## # ℹ 3 more variables: min_longitude <dbl>, max_longitude <dbl>, count <int>

Interpretation

All of the methods produced fairly clearly defined clusters. Hierarchical clustering generally supports the findings of K-Means and PAM, indicating consistency across methods.

3. Characteristics Clustering

sample_data_char <- sample_data %>%
    select(depth, mag, starts_with("magType_")) %>%
    mutate(across(everything(), scale)) # Normalize

Hopkins statistic

hopkins_stat_char <- hopkins(sample_data_char)
print(paste("Hopkins statistic:", hopkins_stat_char))
## [1] "Hopkins statistic: 0.999997862180879"

Interpretation of Hopkins statistic:

Hopkins statistic is almost equal to 1 which means that dataset contains meaningful clusters.

Finding the optimal cluster count

# elbow
elbow <- Optimal_Clusters_KMeans(sample_data_char, max_clusters = 20, plot_clusters = TRUE)

# silhouette
silhouette <- Optimal_Clusters_KMeans(sample_data_char, max_clusters = 20, plot_clusters = TRUE, criterion = "silhouette")

# gapstat
set.seed(123)
gap_stat <- clusGap(sample_data_char, FUN = kmeans, nstart = 25, K.max = 20, B = 50)
print(gap_stat)
## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = sample_data_char, FUNcluster = kmeans, K.max = 20, B = 50, nstart = 25)
## B=50 simulated reference sets, k = 1..20; spaceH0="scaledPCA"
##  --> Number of clusters (method 'firstSEmax', SE.factor=1): 10
##           logW   E.logW       gap      SE.sim
##  [1,] 8.257268 8.949317 0.6920489 0.002439921
##  [2,] 8.062576 8.837118 0.7745417 0.002559216
##  [3,] 7.912677 8.771361 0.8586842 0.002476181
##  [4,] 7.693133 8.714649 1.0215153 0.002512616
##  [5,] 7.388069 8.674101 1.2860330 0.002737252
##  [6,] 7.200334 8.634750 1.4344158 0.002571944
##  [7,] 7.099057 8.607324 1.5082671 0.002523041
##  [8,] 7.050322 8.581552 1.5312297 0.002625742
##  [9,] 6.996471 8.563391 1.5669208 0.002647605
## [10,] 6.936539 8.546297 1.6097578 0.002600588
## [11,] 6.924242 8.530657 1.6064150 0.002612835
## [12,] 6.863725 8.515770 1.6520455 0.002745410
## [13,] 6.804482 8.502502 1.6980204 0.002532187
## [14,] 6.744947 8.489383 1.7444356 0.002615299
## [15,] 6.740198 8.476675 1.7364770 0.002634772
## [16,] 6.684308 8.464029 1.7797205 0.002471554
## [17,] 6.654029 8.454460 1.8004310 0.002414618
## [18,] 6.624396 8.445071 1.8206754 0.002575565
## [19,] 6.598554 8.435944 1.8373902 0.002630412
## [20,] 6.574375 8.426853 1.8524781 0.002590792
fviz_gap_stat(gap_stat)

7 clusters were selected. Elbow and Silhouette clearly suggested the 7 clusters, Gap statistics yet again showed higher count of clusters as preferable choice, but this time selected clusters were even closer to highest value which was proposed.

num_of_clusters <- 7

K-Means for characteristics

PAM for characteristics

Hierarchical Clustering for characteristics

Interpretation of clusters

Clustering earthquakes based on their characteristics (depth, magnitude, and magnitude type) reveals that these features are not strongly tied to specific geographic locations. This dispersion suggests that earthquakes of similar characteristics occur across various tectonic regions without strict correlation to geographic proximity. For example, two regions along the Pacific plate’s boundary may exhibit clusters with different earthquake characteristics due to localized geological variations. Similarly, clusters with similar characteristics can be found on entirely different tectonic plates, indicating that the mechanism driving earthquake characteristics may not always align with the geographic region or tectonic settings.

The overlapping of clusters near tectonic plate boundaries could also be attributed to the inherent complexity and variability of earthquake-generating processes, including subduction zones, transform faults, and spreading centers, each producing earthquakes of varying characteristics. The randomness in clustering could be due to the mixing of various geological processes influencing these features across a global scale.

This analysis highlights the importance of separating localization and characteristic clustering when studying earthquakes, as their patterns reveal different insights into the underlying geological processes. While localization clustering emphasizes spatial patterns, characteristic clustering focuses on the physical traits of seismic events, which are influenced by diverse, complex geological factors.

Calinski-Harabasz Indexes

## [1] "Calinski-Harabasz Index for K-Means (Characteristics): 5132.43"
## [1] "Calinski-Harabasz Index for PAM (Characteristics): 5101.28"
## [1] "Calinski-Harabasz Index for Hierarchical Clustering (Characteristics): 4870.59"

Calinski-Harabasz Index for K-Means is the highest which suggests that K-Means approach is the best one. Worth mentioning is relatively small difference between K-Means and PAM.

Clusters presented in 2D and additional stats

## [1] "Extended statistics for characteristics clusters (K-Means):"
## # A tibble: 7 × 8
##   kmeans_char_cluster mean_depth min_depth max_depth mean_magnitude
##                 <int>      <dbl>     <dbl>     <dbl>          <dbl>
## 1                   1       40.5      3         354.           5.38
## 2                   2       42.2      4.07      327            5.55
## 3                   3       37.3      0.7       332.           5.53
## 4                   4       33.0      7.7       330            6.42
## 5                   5       39.2      0.02      468.           5.68
## 6                   6      533.     325         686            5.40
## 7                   7       47.1      2         290.           5.17
## # ℹ 3 more variables: min_magnitude <dbl>, max_magnitude <dbl>, count <int>

## [1] "Extended statistics for characteristics clusters (PAM):"
## # A tibble: 7 × 8
##   pam_char_cluster mean_depth min_depth max_depth mean_magnitude min_magnitude
##              <int>      <dbl>     <dbl>     <dbl>          <dbl>         <dbl>
## 1                1       40.5      3         354.           5.38          5   
## 2                2       47.0      2         290.           5.17          5   
## 3                3       39.2      0.7       342.           5.48          5   
## 4                4       42.7      4.07      393.           5.55          5   
## 5                5       32.4      7.7       330            6.33          5.86
## 6                6       39.2      0.02      468.           5.68          5   
## 7                7      535.     325         686            5.39          5   
## # ℹ 2 more variables: max_magnitude <dbl>, count <int>

## [1] "Extended statistics for characteristics clusters (Hierarchical):"
## # A tibble: 7 × 8
##   hc_char_cluster mean_depth min_depth max_depth mean_magnitude min_magnitude
##             <int>      <dbl>     <dbl>     <dbl>          <dbl>         <dbl>
## 1               1       40.5      3         354.           5.38          5   
## 2               2       49.9      2         412            5.17          5   
## 3               3       36.2      0.7       284.           5.61          5   
## 4               4       40.9      4.07      270            5.55          5   
## 5               5       33.6     10         330            6.59          6.16
## 6               6       48.7      0.02      669.           5.69          5   
## 7               7      542.     282.        686            5.40          5   
## # ℹ 2 more variables: max_magnitude <dbl>, count <int>

Interpretation

The clustering methods reveal meaningful groupings in the data, distinguishing earthquakes based on their depth and magnitude characteristics. Despite slight differences in cluster boundaries across methods, the overall trends are consistent, confirming the reliability of the clustering approaches. Cluster 5 consistently represents shallow, high-magnitude earthquakes, likely indicating plate-boundary events such as transform or thrust faulting. Cluster 6, with the deepest events, may correspond to subduction zones.

One is able to see that it’s possible to cluster the eartquakes by their characterisitcs, but they are not connected (at least not in a simple way) to specific localizations.

4. Summary

This project analyzed earthquake data by clustering both their geographical locations and physical characteristics using K-Means, PAM, and Hierarchical Clustering. Location-based clustering effectively grouped earthquakes along tectonic plate boundaries, revealing distinct patterns that aligned with known geophysical structures. In contrast, characteristics-based clustering (depth, magnitude, and magnitude type) showed that earthquakes within similar regions exhibited diverse physical properties, highlighting phenomena like deep subduction events versus shallow, high-magnitude earthquakes. PCA visualizations validated the separability of the clusters, and statistical summaries provided insights into trends such as depth and magnitude distribution within each cluster. Overall, the project demonstrated how clustering techniques could uncover meaningful patterns in seismic data, contributing valuable insights for geophysical studies and disaster preparedness.