The Interplay of Magnitude and Plate
Proximity: The study of 2023 global earthquake trends
Brynhildur Traustadóttir
2024-04-30
Abstract
When the tectonic plate boundaries were first recognised, they were
identified by several factors, one of which included patterns observed
from seismic activity data. The data inherently marks regions of seismic
activity where the tectonic forces converge and it is widely known that
seismic events are associated with all tectonic plate boundaries. In
this study, I will explore the complex relationship between the seismic
energy and its plate proximity. Making use of a seismic event dataset
from the year of 2023 and a dataset containing geographic coordinates
that completely enclose all tectonic plate boundaries, I investigate how
seismic magnitude relates to the proximity of a plate boundary. By
examining over 26,000 recorded seismic events with the use of various
data preprocessing methods I begin to understand the behavior of our
data, assemble several visualizations to observe the behavior and
identify patterns and relationships. With those relationships I will
compare different predictive analysis methods and uncover the detailed
dynamics those relationships hold. With the use of the decision
tree-based model Random Forest, the results indicate the seismic
magnitude that can be identified with the plate proximity using the
extreme value theorem. It provides initial insight into understanding
the relationship between the seismic magnitude level and its plate
proximity.
Introduction
Being born and raised in Iceland, “the land of fire and ice”, I have
been exposed to multiple seismic activities, which sparked interest in
the science of seismology. Iceland is located in between two tectonic
plates, the Eurasian plate and the North American Plate and experiences
frequent seismic activity due to its tectonic setting. When the plate
boundaries were first discovered and drawn on a map, scientists drew the
boundaries along areas of high seismicity, so it is widely known that
seismic activity is highly correlated with the tectonic plates (Wilson,
2021). There are three different types of plate boundaries. A
“convergent” plate boundary, where the crusts come together and collide.
In some cases, an event called “subduction” occurs where one tectonic
plate dives underneath the other. The second type is a “divergent” plate
boundary. Where the two tectonic plates spread apart and often can form
a ridge. The third type is a “transform” plate boundary. The two crusts
slide past one another, horizontally (National Geographic Society,
2024).
In November of 2023, the town of Grindavík near the Reykjanes
Peninsula of Iceland encountered numerous seismic events. During the
week of November 7th to 14th, approximately 13,700 seismic events were
recorded with over 1,000 being significant. This significant seismic
activity is what inspired my interest in exploring this topic further.
Figure 1 Highlights the extent that the damage occurred and shows the
fissure that had emerged through the town. The town has remained
evacuated since (“Vikuleg Jarðskjálftayfirlit” n.d.).
Seismology became its own science in the late nineteenth century and
since then there has been a lot of research about the relationship
between the depths of seismic events and plate boundaries (Agnew, 1989).
Most events occur on plate boundaries but it is dependent on the type,
how deep the event goes. For example, according to Wilson (2021),
“Transform boundaries tend to have steeper rupture surfaces, which cause
the earthquakes to occur in a narrow zone and limits their size”. As
mentioned earlier, subduction occurs when one crust goes beneath
another. Its seismic activity is very abundant and causes greater depth
in the crust than in other plate boundaries (Wilson, 2021). There is
also a correlation between the intensity of the seismic event and its
depth. With a deeper event, the distance from the source to the surface
of the crust increases and the strength of shaking diminishes.
Consequently, two events with equal strength but varying depths carry
different levels of magnitude (“At what depth do earthquakes occur? What
is the significance of the depth?”, n.d).
While it has been established that the majority of seismic events lie
between the plate boundaries, there remains a subset of events that
occur far away from these boundaries. Shallow events can yield to a
greater impact however, the question remains: do seismic events that
occur further away from plate boundaries also exhibit high magnitudes?
In this study, I delve into seismic event dataset from the year of 2023
to understand the correlation between seismic magnitude and their
proximity to plate boundaries.
Methods
Data Preprocessing
I acquired a dataset documenting global seismic events throughout the
year 2024 from Kaggle.com. The dataset includes over 22 attributes,
providing information on more than 26,000 recorded seismic events. Key
attributes include magnitude, time, depth and geographic coordinates of
each event. However, as the dataset lacked information about tectonic
plates, I obtained an additional dataset containing the geographic
coordinates of all tectonic plates, major and minor, along with their
respective names (Keser, 2023; Thompson, 2020). An illustration of the
earth’s lithosphere with the marking of all plate boundaries is shown in
figure 2.
Figure 2: Map illustrating all major and minor
tectonic plate boundaries.
The majority of the data preprocessing methods were used using
python; details of the code can be found in the appendix. Initially, the
dataset contained over 26,000 entries. However, to ensure the accuracy
of the data, entries with missing values were removed resulting in
approximately 22,000 valid observations.
To investigate the proximity of seismic events to tectonic plates, a
new attribute named “distance” was created, calculated using the
Haversine Formula. This mathematical formula is used to calculate the
shortest distance between two points on the surface of a sphere in
kilometers:
\[
c = 2 \cdot \text{atan2}\left(\sqrt{a}, \sqrt{1-a}\right)
\]
\[
d = R \cdot c
\]
Where \(d\) is the distance between
the two points (along the surface of the sphere), \(\Delta lat\) is the difference in latitude
between the two points, \(\Delta lon\)
is the difference in longitude between the two points, \(R\) is the radius of the sphere (6,371
kilometers), \(lat_1\) and \(lon_2\) are the latitudes of the two points
in radians, \(\text{atan2}\) is the
two-argument arctangent function (Upadhyay, 2024).
The rest of the preprocessing steps were performed in RStudio. These
steps involve removing unnecessary attributes and adjusting data types
to simplify further research. The table below outlines the attributes of
the data that will be utilized in this study.
# Removing unnecessary columns (id & updated)Earthquakes <- earthquake_plates_df[, c("time", "latitude", "longitude", "depth", "mag", "type", "area", "distance", "plate_latitude", "plate_longitude", "plate_name")]# Making a dataframe with only earthquakes (26,428)# only including magnitude of 4.5 and abovedf = Earthquakes %>%filter(type =="earthquake") %>%filter(mag >4.499)# Converting column typesdf$time <-as.POSIXct(df$time, format="%Y-%m-%dT%H:%M:%OSZ", tz ="UTC")df$mag =as.numeric(df$mag)df$type =as.factor(df$type)
Exploratory Data
Analysis
Before diving into complex analysis, I conducted Exploratory Data
Analysis (EDA) to understand the structure and distribution of the
dataset. EDA serves as a crucial step in understanding the underlying
patterns and relationship with the data.
It is recommended to scroll down to the bottom of the sheet and
click the button that says Change display settings and choose
desktop layout. Then you should see the visualizations as
desired.
Magitude Distribution
The distribution of seismic magnitudes is represented in figure
Magnitude Distribution through an area plot. Upon examination I
observed an irregular and inconsistent pattern up to the value of 4.5.
This observation led me to believe that the reliability of these data
points were questionable. Consequently, all data points below a
magnitude of 4.5 have been excluded from the dataset. The analysis will
focus more on significant seismic events, ensuring more robust research.
Left are approximately 8,600 observations.
The table below outlines the summary statistics of the key
attributes. As previously mentioned, the minimum magnitude recorded is
4.5. Interestingly, both the mean and the median values are similar to
the minimum value. Similarly, for the depth and the distance, their mean
and median values align closely to the minimum value. These summaries
indicate that a large portion of the data points are clustered closely
to the minimum values.
## depth mag distance
## Min. : 1.873 Min. :4.50 Min. : 0.3696
## 1st Qu.: 10.000 1st Qu.:4.50 1st Qu.: 32.9213
## Median : 16.000 Median :4.60 Median : 62.8752
## Mean : 64.055 Mean :4.75 Mean : 119.1783
## 3rd Qu.: 61.881 3rd Qu.:4.90 3rd Qu.: 131.5074
## Max. :653.516 Max. :7.80 Max. :2310.7683
Depth Distribution Map
Visualizations play an important role in EDA, allowing us to identify
trends, outliers and potential relationships in our attributes. Figure
Depth Distribution Map displays the seismic events distributed
on a map with each point representing an event. The size and color
indicate the depth, highlighting the deep events in southeast Asia. This
region corresponds to the convergence of the Eurasian and the North
American plates, forming a convergent plate boundary, prone to
subduction. As previously mentioned, subduction often results in greater
depths compared to other plates. This visualization thereby supports
that observation.
Frequency of Earthquakes
Figure Frequency of Earthquakes, showcases a dashboard
featuring multiple visualizations that will increase our understanding
of the data. Figure a, exhibits the same area plot shown in figure 3,
though with the data points less than 4.5 excluded. Over 3,000 recorded
observations register a magnitude of 4.5, representing 41% of our data
and demonstrating a right skewness in the distribution. As the magnitude
increases, the frequency sharply declines, supporting our earlier
statement that a significant portion of the data is clustered closely
around the minimum values.
Figure b, illustrates a density map representing the activity around
the globe. Meanwhile, figure c illustrates a time series plot showcasing
the distribution of the seismic activity throughout the year 2023. The
time series plot displays a generally uniform distribution, marked by
three noticeable spikes in activity. The tallest spike occurs on
December 2nd and 3rd, recording over 350 seismic events solely in the
Philippines. The heightened activity is noticeable on the map,
particularly in Southeast Asia.
Distance by Area
Displayed in figure Distance by Area and Plate Proximity
and Magnitude, are detailed information about the “distance”
attribute in hope to have a better understanding of its distributions
and relationships. The regions furthest away from the plate boundaries
are showcased in a barplot in figure Distance by Area, with
each bar labeled with the count of events occurring within that region.
The map shows the seismic events with the sizes corresponding to the
distance. The seismic event that occurred furthest away from a plate is
located in the French Polynesia Region which is located on the Pacific
Plate.
Plate Proximity and Magnitude
Illustrated in figure a, the bar plot represents the distribution of
the distance in sets of 100 kilometers. The first bar represents over
5,000 data points which is approximately 65% of our total data points.
As previously discussed, the majority of seismic activity lies between
or close to plate boundaries so this information should not come as a
surprise. As the distance increases, the frequency dramatically
decreases. From a closer look, up to 20 observations are more than 1,500
kilometers from the nearest tectonic plate.
Figure b, a key visualization, showcases the correlation between the
seismic magnitude and plate proximity, also known as the distance.
Initially the plot suggests a negative linear relationship between the
two attributes, indicating that as the distance from the event increases
the magnitude decreases. However, upon closer inspection, it becomes
evident that the majority of the data points cluster within the range of
4.5 to 5.0 magnitude and 0 to 100 kilometers in distance. Consequently,
implementing a simple linear regression leads to insignificant results.
Nonetheless, as initially suggested, the highest values for each
magnitude set, in increments of 0.1, show a negative linear
relationship. With this observation in mind, I turned to the Extreme
Value Theorem, which provides a mathematical framework to deal with
extreme events and their probabilities. Additionally, an extra attribute
has been included in the dataset that aggregates the maximum distance
for each set of magnitude, in increments of 0.1.
# Create a column magFactor, increments of 0.1breaks <-seq(4.5, 8.0, by =0.1)labels <-paste0(breaks[1:length(breaks)-1])df$magFactor <-cut(df$mag, breaks = breaks, labels = labels, right =FALSE)# using extreme value theorem# Aggregating highest minPlateDistance for each magFactor levelmax_distance <- df %>%group_by(magFactor) %>%summarise(max_distance =max(distance))# Merging aggregated data back into original dataframedf <-left_join(df, max_distance, by ="magFactor")
The plot below, showcases the relationship between the maximum
distance and the seismic magnitude.
ggplot(df, aes(x = magFactor, y = max_distance)) +geom_point(color ="#f0a0d1", size =3, alpha =0.6) +labs(x ="Magnitude", y ="Maximum Distance") +ggtitle("Maximum distance by Sets of Magnitude") +theme_ipsum() +theme(plot.title =element_text(size =14, face ="bold"), axis.title =element_text(size =12), axis.text =element_text(size =10), legend.position ="none" )
Frequency by Country
The figure Frequency by Country illustrates the frequency of
seismic events for each country.
Modeling
To answer the research question of whether seismic events occurring
further from plate boundaries also exhibit high magnitudes, I conducted
and compared various data modeling methods. To ensure an unbiased
evaluation of the models performance, the dataset was divided into a
training set and a testing set. The training set contains 80% of the
original data and the testing contains the remaining 20%. I explored
four distinct modeling techniques using cross-validation to analyze the
relationship between seismic magnitude and maximum distance from plate
boundary. The corresponding code can be found in the appendix.
Random Forest: Random Forest is an ensemble tree-based method
that constructs multiple decision trees and outputs the mean prediction
of the individual trees (James, Witten, Hastie, & Tibshirani,
2023).
Gradient Boosting Machine (GBM): GBM is a boosting algorithm that
is also a tree-based method. It builds multiple decision trees with each
tree trying to correct the error from the previous one (James, Witten,
Hastie, & Tibshirani, 2023).
Recursive Partitioning Tree (RPT): RPT is also a tree-based
method. It is a non-linear predictive model that divides the dataset
into subsets based on the values of the input variables (Izenman,
2013).
Linear Regression: A supervised learning model that is used to
model the relationship between a dependent variable and an independent
variable (James, Witten, Hastie, & Tibshirani, 2023).
For all modeling techniques, the magnitude was the dependent variable
and the maximum distance was the independent variable and they were
trained on the training set.
# Splitting the data into training and testinglibrary(caret)set.seed(123)train_index <-createDataPartition(df$mag, p =0.8, list =FALSE, times =1)train_data <- df[train_index, ]test_data <- df[-train_index, ]# Training set with a repeated cross validationtrain_ctrl <-trainControl(method ="repeatedcv", number =5, repeats =3)set.seed(2018)# Recursive Partitioning Tree with only distance as an attributerpart_tree_dist <-train(mag ~ max_distance,data = train_data,method ="rpart",trControl = train_ctrl)# Random Forest Tree with only Distance as an attributerf_tree_dist <-train(mag ~ max_distance,data = train_data,method ="rf",trControl = train_ctrl)# Gradient Boosting Machine with only distance as an attributegbm_tree_dist <-train(mag ~ max_distance,data = train_data,method ="gbm",trControl = train_ctrl,distribution ="gaussian",verbose =FALSE)lm_tree_dist <-train(mag ~ max_distance,data = train_data,method ="lm",trControl = train_ctrl)resamp <-resamples(list(rpart_tree = rpart_tree_dist, randomForest = rf_tree_dist,GBM = gbm_tree_dist,LM = lm_tree_dist))summary(resamp)
The visualization below illustrates the results from the four trained
models that were mentioned in the modeling section. The figure shows the
box plots of three different values. The first box plot represents the
Mean Absolute Error (MAE) of the models. The MAE measures the absolute
average difference between the predictive value and the actual value. A
lower MAE indicates a better model. The second box plot represents the
Root Mean Squared Error (RMSE). The RMSE measures the square root
average difference between the predictive value and the actual value. It
is very similar to MAE, with a lower value indicating a better model;
however, the RMSE is more sensitive to larger errors and tends to be
more accurate. The third box plot represents the \(R^2\) values. The \(R^2\) measures how well the independent
variable can explain the dependent variable. The value ranges from 0 to
1 where the closer the value is to 1, the better the independent
variable can explain the dependent variable.
# Extract values from resamp objectresamp_values <- resamp$values# Reshape the data to long formatresamp_long <-pivot_longer(resamp_values, cols =-Resample, names_to ="Model_Metric", values_to ="Value")# Extract models and metrics from the Model_Metric columnresamp_long <-separate(resamp_long, col = Model_Metric, into =c("Model", "Metric"), sep ="~")resamp_long$Model <-factor(resamp_long$Model, levels =c("randomForest", "rpart_tree", "GBM", "LM"),labels =c("Random Forest", "RPT", "GBM", "Linear Regression"))custom_palette <-c("#0072B2", "#D55E00", "#009E73", "#CC79A7")# Creating a themecustom_theme <-theme_ipsum(base_size =12) +theme(panel.grid.major.y =element_line(color ="grey80"),panel.grid.minor.y =element_blank(),axis.text.x =element_text(angle =45, hjust =1) )ggplot(data = resamp_long, aes(x = Model, y = Value, fill = Model)) +geom_boxplot(outlier.shape =NA, width =0.6) +facet_wrap(~ Metric, scales ="free_y") +scale_fill_manual(values = custom_palette) +labs(title ="Boxplot of Metrics by Model",x ="Model",y ="Value",fill ="Model") + custom_theme
As previously discussed, lower MAE and RMSE values, along with a high
value of the R2 indicates a better model performance. Therefore, the
objective is to identify the model that shows the best results. Upon
initial inspection, the Random Forest model outperformed the other
models across all scales. The MAE and RMSE values suggest that the
predictions were generally accurate with little to none variation. The
\(R^2\) value of 0.9996 shows that the
model explained the majority of the variance in the data. Overall, these
results indicate that the model captured patterns in the data and it is
highly accurate and reliable.
# Making predictions using randomForest modelpredictions_rf <-predict(rf_tree_dist, newdata = test_data)cat("Random Forest Prediction:", mean(predictions_rf), "\n")
## Random Forest Prediction: 4.755023
The trained random forest model was used to make predictions on the
testing dataset, resulting in a predicted seismic magnitude of 4.755.
This prediction explains the model’s ability to estimate the magnitude
based on the maximum plate proximity. Notably, the mean of the magnitude
in the dataset is 4.75 which indicates that the predicted value aligns
closely with the average magnitude observed in the dataset.
Conclusion
The science of seismology is dated back to the late nineteenth
century and scientists have found valuable insights in the correlation
between seismic activity and plate tectonics. Over the years, they have
linked the depth of a seismic event with the seismic magnitude. As well
as the different effects and characteristics between boundary types. In
this study, I investigated the impact of plate proximity on seismic
magnitude using a dataset of approximately 8,600 recorded seismic events
from the year 2023. With the help of the Haversine formula, I created an
additional attribute representing the distance between each seismic
event and the nearest tectonic plate.
By combining various visualizations, I understood the behavior and
relationship between seismic magnitude and plate proximity. By
implementing the Extreme value theorem, I uncovered a pattern indicating
a greater plate proximity has correlation with lower seismic magnitude
within the dataset. Utilizing this pattern, I trained a Random Forest
model to predict the magnitude of a seismic event with the maximum
distance from a tectonic plate. Finally, I used the trained model to
make predictions on the untrained part of the dataset, with accurate and
reliable results.
Izenman, A.J. (2013). Recursive Partitioning and Tree-Based Methods.
In: Modern Multivariate Statistical Techniques. Springer Texts in
Statistics. Springer, New York, NY. https://doi.org/10.1007/978-0-387-78189-1_9
James, G., Witten, D., Hastie, T. & Tibshirani, R. (2023). An
Introduction to Statistical Learning: with Applications in R: Second
Edition.