Unsupervised Learning Clustering Assignment

Si Tang Lin

Introduction

On April 3, 2024, an earthquake with a magnitude of 7.2 on the Richter scale struck eastern Taiwan, caused by fault movement. The shaking lasted for about 98 seconds. Aftershock activity continued over the next few days, leading to the collapse of major roads in the eastern region and varying degrees of damage to bridges and elevated rail lines. During this period, Taiwan issued two national-level alerts, resulting in 18 fatalities and thousands of injuries. To further analyze the locations and seismic depths of the earthquakes from April 2 to April 7, I plan to use clustering analysis methods.

Prerequisite

# Load require library
library(ggplot2) 
library(tidyverse) 
library(factoextra) 
library(cluster) 
library(wesanderson)
library(dbscan)
library(plotly)
library(gridExtra)
library(rnaturalearth)
library(rnaturalearthdata)
library(sf)
library(tmap)

Load Dataset and Key Insight

Data resources:https://earthquake.usgs.gov/earthquakes/search/

The data was provided by USGS earthquake data, i selected the earthquake data occured in Taiwan from April 2nd to April 7th.

query<-read.csv("/Users/ninalin/Desktop/query.csv")
view(query)
str(query)
## 'data.frame':    76 obs. of  5 variables:
##  $ time     : chr  "2024-04-07T22:49:30.435Z" "2024-04-07T10:24:38.426Z" "2024-04-07T02:06:00.792Z" "2024-04-06T21:15:08.953Z" ...
##  $ latitude : num  24.2 24.1 24.2 24 24.2 ...
##  $ longitude: num  122 122 122 122 122 ...
##  $ depth    : num  37.5 15.6 16 18.9 33.8 ...
##  $ mag      : num  4.6 4.5 4.5 4.6 4.8 5.2 4.7 4.9 4.5 4.9 ...
summary(query)
##      time              latitude       longitude         depth       
##  Length:76          Min.   :23.73   Min.   :121.4   Min.   : 6.723  
##  Class :character   1st Qu.:23.88   1st Qu.:121.7   1st Qu.:16.843  
##  Mode  :character   Median :24.04   Median :121.7   Median :26.235  
##                     Mean   :24.02   Mean   :121.7   Mean   :25.274  
##                     3rd Qu.:24.17   3rd Qu.:121.8   3rd Qu.:35.000  
##                     Max.   :24.29   Max.   :121.9   Max.   :45.326  
##       mag       
##  Min.   :4.500  
##  1st Qu.:4.500  
##  Median :4.750  
##  Mean   :4.847  
##  3rd Qu.:5.000  
##  Max.   :7.400

Earthquake Location Analysis

Perform the location of earthquake base on the latitude and longitude
set.seed(123)
loc_data <- scale(query[, c("latitude", "longitude")])

loc_elbow<-fviz_nbclust(loc_data, kmeans, method = "wss") +ggtitle("Elbow Method")
loc_silhouette<- fviz_nbclust(loc_data, kmeans, method = "silhouette") +ggtitle("Silhouette Method")

grid.arrange(loc_elbow,loc_silhouette, ncol=2, top="Optimal Number of Clusters")

Used both elbow method and silhouette to compare the optimal number of cluster. The graph indicates that the optimal number of cluster is 4.

set.seed(123)
location_data <- scale(query[, c("latitude", "longitude")])
kmeans_result2<-kmeans(location_data,centers = 4)
query$cluster2<-kmeans_result2$cluster

km_location<-eclust(query[, c("longitude","latitude")], k=4, FUNcluster = "kmeans", hc_metric = "euclidean", graph = F)

fviz_cluster(km_location, data = query[, c("longitude","latitude")],geom = "point", ellipse.type = "convex")+labs(title="Earthquake Location with Kmeans",x = "Latitude", y = "Longitude")+theme_minimal()

Earthquake Location Analysis on Real World Map

world_data <- ne_countries(scale = "medium", returnclass = "sf") 
taiwan <- world_data %>%
  filter(name == "Taiwan") 

ggplot(data = taiwan)+ geom_sf(fill = "gray", color= "black")+geom_point(data= query , aes(x = longitude, y = latitude, color = cluster2), size = 2)+ggtitle("Earthquake Clustering on Taiwan Map")+labs(x = "longitude", y="latitude", color="cluster")

taiwan <- ne_countries(scale = "medium", returnclass = "sf") %>%
  dplyr::filter(name == "Taiwan")

tmap_mode("view")
## tmap mode set to interactive viewing
query_sf <- st_as_sf(query, coords = c("longitude", "latitude"), crs = 4326)
tm_shape(taiwan) +   tm_polygons(col = "lightblue", border.col = "black") +   tm_shape(query_sf) +   tm_dots(col = "cluster2", palette = wes_palette("GrandBudapest2", 4), size = 0.1) +   tm_layout(title = "K-means clustering on Taiwan Map")

Perform the result on the real world map

The clusters indicates the majority of the earthquakes location, which are concentrated along Taiwan’s eastern coastline and offshore area. By using K-means clustering, we can know the distinct region that are high-risk seismic activity zone. This can further provide an insights into seismic distribution patterns, to prevent further damage.

Magnitude and Depth Analysis

Magnitude

scaled_magnitude<- scale(query$mag)
fviz_nbclust(as.data.frame(scaled_magnitude), kmeans, method =
"wss")+labs(title= "Elbow Method For Optimal K for Magnitude")

Number set for optimal K

set.seed(123)
k<-3
kmeans_result<-kmeans(scaled_magnitude, centers = k, nstart =25)
query$magnitude_cluster<-as.factor(kmeans_result$cluster)

ggplot(query, aes(x=mag, y=0, color = magnitude_cluster))+geom_jitter(width = 0.1, height = 0.1, size=2, alpha=0.8)+scale_color_manual(values = palette(wes_palette("GrandBudapest2", 4)))

table(kmeans_result$cluster)#to check the exact data
## 
##  1  2  3 
## 45 29  2

A total of 45 earthquakes below magnitude 4.9 form the pink cluster. Another 29 earthquakes, with magnitudes ranging from 4.9 to 5.7, make up the purple cluster. Finally, two major earthquakes with magnitudes of 6.4 and 7.4 respectively, represent the brown cluster.

silhouette_magnitude<- silhouette(kmeans_result$cluster, dist(scaled_magnitude))

fviz_silhouette(silhouette_magnitude)+labs(title="Silhouette Plot of Magnitude Performance", x="Cluster", y="Silhouette Width")
##   cluster size ave.sil.width
## 1       1   45          0.72
## 2       2   29          0.52
## 3       3    2          0.40

Visualize shadow statistics to evaluate the performance.

Cluster 1 reach 0.72 average silhouette width, indicates a well performance of the cluster, the cluster are tightly grouped Cluster 2 with overlap points and boundary points so only reach 0.52 average silhouette width. Cluster 3 was influence by the outlier(major earthquake), so average silhouette width is 0.4.

Depth

scaled_depth<- scale(query$depth)
fviz_nbclust(as.data.frame(scaled_depth), kmeans, method="wss")+labs(title= "Elbow Method For Depth")

set.seed(123)
k<-2
kmeans_result_depth<-kmeans(scaled_depth, centers = k, nstart = 25)
query$depth_cluster<-as.factor(kmeans_result_depth$cluster)
ggplot(query, aes(x= depth, y = 0, color= depth_cluster))+geom_jitter(width= 0.1, height=0.1, size= 2, alpha= 0.8)+scale_color_manual(values= palette(wes_palette("GrandBudapest2", 2)))+labs(title="K-means Clustering of Earthquake Depth", x= "Depth" , y="Clusters", color="Cluster")

There are two clusters identified in the depth data.

Cluster 1 (pink Cluster) comprises 33 data points, representing shallow to intermediate-depth earthquakes with focal depths ranging from 6.7 to 24.9 km. On the other hand, Cluster 2 (purple Cluster) includes 43 data points, corresponding to intermediate-depth earthquakes with focal depths between 25.0 and 45.3 km.

silhouette_depth<-silhouette(kmeans_result_depth$cluster, dist(scaled_depth))
fviz_silhouette(silhouette_depth)+labs(title="Silhouette Plot of Depth Performance", x="Cluster", y="Silhouette Width")
##   cluster size ave.sil.width
## 1       1   33          0.65
## 2       2   43          0.65

Visualize shadow statistics to evaluate the performance.

Both of the clusters reach 0.65 average width, indicate a moderate performance. The data points are correctly grouped and clear within each other.

Depth and Magnitude Analysis

Based on the earlier outcomes, the data is primarily concentrated in the lower magnitude range. Therefore, I chose DBSCAN for the combined analysis of earthquake magnitude and depth. Since the distribution of earthquakes is highly uneven and the dataset includes extreme magnitudes, I believe DBSCAN is better suited to capture the significant density clustering characteristics of the earthquakes.

depth_magnitude_data<-query[, c("depth", "mag")]
scaled_data<-scale(depth_magnitude_data)
kNNdistplot(scaled_data, k=4)
abline(h= 0.5, col="red", lty=2)

dbscan_result<-dbscan(scaled_data, eps=0.43, minPts = 5)
query$dbscan_cluster<-as.factor(dbscan_result$cluster)
ggplot(query, aes(x= mag, y= depth, color=dbscan_cluster))+geom_point(size=2, alpha=0.8)+scale_color_manual(values= c("0"="darkgray", "1"="pink", "2"="purple", "3"= "lightblue"))

print(dbscan_result)
## DBSCAN clustering for 76 objects.
## Parameters: eps = 0.43, minPts = 5
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 3 cluster(s) and 15 noise points.
## 
##  0  1  2  3 
## 15 44 12  5 
## 
## Available fields: cluster, eps, minPts, metric, borderPoints

###Check whether the noise ratio is accurate and below 20%.

noise_ratio<-summary(query$dbscan_cluster)/nrow(query)
noise_ratio
##          0          1          2          3 
## 0.19736842 0.57894737 0.15789474 0.06578947
##Ratio Distribution
##Cluster 0: 19.7%-noise
##Cluster 1: 57.9%
##Cluster 2: 15.8%
##Cluster 3: 6.6%

The clustering distribution appears reasonable, with the majority of the data concentrated in the primary cluster (57.9%). The noise ratio is moderate (19.7%), which may effectively capture sparse regions or extreme earthquake events.

###Check the values of each cluster.

aggregate(query[, c("depth", "mag")], by = list(Cluster =
query$dbscan_cluster), FUN = function(x) c(mean = mean(x), sd = sd(x)))
##  Cluster   depth.mean   depth.sd     mag.mean       mag.sd 
##1   0       21.200333    12.385837    5.34666667     0.74341554 
##2   1       30.580455     5.848346    4.73181818     0.20089651 
##3   2       17.212667     3.393173    4.51666667     0.03892495 
##4   3       10.141200     2.316420    5.16000000     0.08944272

By comparing the data and the graph, we can infer that the noise cluster originates from the distribution of the main shock and other shallow earthquakes with low depth and magnitude. Cluster 1 represents deep earthquakes with magnitudes concentrated around 4.7, Cluster 2 corresponds to intermediate-depth earthquakes, while Cluster 3 consists of shallow earthquakes with relatively higher magnitudes.

Perform on the 3D plot

Create a 3D plot to visualize the relationship between Magnitude, Depth, and Cluster assignments, providing a clear representation of how the data points are distributed across the three dimensions.

data<-tibble(magnitude= query$mag, depth = query$depth, cluster = query$dbscan_cluster)

plot_ly(data, 
        x = ~magnitude, 
        y = ~depth, 
        z = ~cluster, 
        color = ~as.factor(cluster), 
        palette = wes_palette("GrandBudapest2", 4)) %>%
  add_markers(size = 4) %>%
  layout(
    title = "3D Scatter Plot of Depth and Magnitude",
    scene = list(
      xaxis = list(title = "Magnitude"),
      yaxis = list(title = "Depth"),
      zaxis = list(title = "Cluster")
    )
  )

Conclusion

Taiwan located at the collision zone between the Eurasian Plate and the Philippine Sea Plate, which result in frequent earthquake. These earthquakes vary in magnitude and depth, but they often cause serious and unpredictable damage to infrastructure and lives.

For this analysis,I used both K-means and DBSCAN clustering to investigate the seismic activity over a seven-day period, which provided insights into general regional clustering and effective in capturing denser seismic clusters and detecting outliers.

This research contributes valuable information of how these earthquakes are distributed in Taiwan and offers insights that could be useful for future research. It also provides ways to improve disaster preparedness and strategies for earthquake preparedness and mitigation in high-risk areas.

Sources:

https://medium.com/saralkarki/earthquake-cluster-analysis-k-means-approach-cdb2bf6cb21b https://www.nature.com/articles/s41467-020-17841-x