## 'data.frame': 1000 obs. of 19 variables:
## $ title : chr "M 6.5 - 42 km W of Sola, Vanuatu" "M 6.5 - 43 km S of Intipucá, El Salvador" "M 6.6 - 25 km ESE of Loncopué, Argentina" "M 7.2 - 98 km S of Sand Point, Alaska" ...
## $ magnitude: num 6.5 6.5 6.6 7.2 7.3 6.6 6.9 7.2 6.6 7.1 ...
## $ date_time: chr "16-08-2023 12:47" "19-07-2023 00:22" "17-07-2023 03:05" "16-07-2023 06:48" ...
## $ cdi : int 7 8 7 6 0 5 4 8 6 3 ...
## $ mmi : int 4 6 5 6 5 4 4 6 6 4 ...
## $ alert : chr "green" "yellow" "green" "green" ...
## $ tsunami : int 0 0 0 1 1 1 1 1 1 1 ...
## $ sig : int 657 775 899 860 820 802 741 804 733 777 ...
## $ net : chr "us" "us" "us" "us" ...
## $ nst : int 114 92 70 173 79 95 136 85 50 98 ...
## $ dmin : num 7.177 0.679 1.634 0.907 0.879 ...
## $ gap : num 25 40 28 36 173 ...
## $ magType : chr "mww" "mww" "mww" "mww" ...
## $ depth : num 193 69.7 171.4 32.6 21 ...
## $ latitude : num -13.9 12.8 -38.2 54.4 54.5 ...
## $ longitude: num 167.2 -88.1 -70.4 -160.7 -160.8 ...
## $ location : chr "Sola, Vanuatu" "Intipucá, El Salvador" "Loncopué, Argentina" "Sand Point, Alaska" ...
## $ continent: chr "" "" "South America" "" ...
## $ country : chr "Vanuatu" "" "Argentina" "" ...
# Converting tsunami to a factor variable
earthquake_df$tsunami = as.factor(earthquake_df$tsunami)
# Converting alert to a factor variable
earthquake_df$alert = as.factor(earthquake_df$alert)
# Converting magType to a factor variable
earthquake_df$magType = as.factor(earthquake_df$magType)
# Splitting time_date into two seperate columns called date
earthquake_df$time <- sapply(strsplit(as.character(earthquake_df$date_time), " "), "[[", 2)
earthquake_df$date <- sapply(strsplit(as.character(earthquake_df$date_time), " "), "[[", 1)
# Converting the date column into a date variable
earthquake_df$date = as.Date(earthquake_df$date, format = "%d-%m-%Y")
# removing the time_date variable
# removing location due to having longitude and latitude
# removing continent due to having too many null variables
# removing title due to being unnecessary
earthquake_df = earthquake_df[,c(-1, -3, -17, -18)]
str(earthquake_df)## 'data.frame': 1000 obs. of 17 variables:
## $ magnitude: num 6.5 6.5 6.6 7.2 7.3 6.6 6.9 7.2 6.6 7.1 ...
## $ cdi : int 7 8 7 6 0 5 4 8 6 3 ...
## $ mmi : int 4 6 5 6 5 4 4 6 6 4 ...
## $ alert : Factor w/ 5 levels "","green","orange",..: 2 5 2 2 1 2 2 2 2 2 ...
## $ tsunami : Factor w/ 2 levels "0","1": 1 1 1 2 2 2 2 2 2 2 ...
## $ sig : int 657 775 899 860 820 802 741 804 733 777 ...
## $ net : chr "us" "us" "us" "us" ...
## $ nst : int 114 92 70 173 79 95 136 85 50 98 ...
## $ dmin : num 7.177 0.679 1.634 0.907 0.879 ...
## $ gap : num 25 40 28 36 173 ...
## $ magType : Factor w/ 9 levels "mb","md","Mi",..: 9 9 9 9 3 9 9 9 9 9 ...
## $ depth : num 193 69.7 171.4 32.6 21 ...
## $ latitude : num -13.9 12.8 -38.2 54.4 54.5 ...
## $ longitude: num 167.2 -88.1 -70.4 -160.7 -160.8 ...
## $ country : chr "Vanuatu" "" "Argentina" "" ...
## $ time : chr "12:47" "00:22" "03:05" "06:48" ...
## $ date : Date, format: "2023-08-16" "2023-07-19" ...
Des <- data.frame(
Column = c("Magnitude","CDI", "MMI", "Alert", "Tsunami", "Sig", "Net", "Nst", "Dmin", "Gap",
"MagType", "Depth", "Latitude/Longitude", "Country", "Time", "Date"),
Type = c("Num","Int", "Int", "Factor", "Factor", "Int", "Chr", "Int", "Num", "Num",
"Factor", "Num", "Num", "Chr", "Chr", "Date"),
Description = c(
"The Magnitude in Richter Scale of the Event",
"Community Decimal Intensity) Maximum reported intensity for the event range",
"(Modified Mercalli Intensity) Maximum estimated instrumental intensity for the event",
"Alert level, Green - Yellow - Orange - Red",
"'1' for events in oceanic area, '0' otherwise",
"Significance of event, calculated from the magnitude, maximum mmi, felt reports and estimated impact",
"ID of a data contributer",
"Total number of seismic stations used to determine events location",
"Horizontal distance from the epicenter to the nearest station",
"The largest azimuthal gap between azimuthally adjacent stations",
"The method/algorithm used to calculate the magnitude of the event",
"The depth of the start of the earthquake",
"Coordinate system of the location of the event",
"Affected country",
"Time of the event",
"Date of the event"))
datatable(Des, rownames = FALSE, caption = 'Column Description', options = list(theme = 'bootstrap'))In the further research, magnitude and depth will be used as the response variables
Before any modelling, the data is split into train and test data with the ratio of 1:4
# Summary statistics
summary(earthquake_df[, c("magnitude", "cdi", "mmi", "sig", "nst", "dmin", "gap", "depth")])## magnitude cdi mmi sig
## Min. :6.50 Min. :0.000 Min. : 1.000 Min. : 650.0
## 1st Qu.:6.60 1st Qu.:0.000 1st Qu.: 5.000 1st Qu.: 691.0
## Median :6.80 Median :4.000 Median : 6.000 Median : 744.0
## Mean :6.94 Mean :3.605 Mean : 6.027 Mean : 847.9
## 3rd Qu.:7.10 3rd Qu.:7.000 3rd Qu.: 7.000 3rd Qu.: 874.2
## Max. :9.10 Max. :9.000 Max. :10.000 Max. :2910.0
## nst dmin gap depth
## Min. : 0.0 Min. : 0.000 Min. : 0.00 Min. : 2.70
## 1st Qu.: 0.0 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 16.00
## Median : 0.0 Median : 0.000 Median : 18.00 Median : 29.00
## Mean :193.9 Mean : 1.125 Mean : 20.93 Mean : 74.61
## 3rd Qu.:403.0 3rd Qu.: 1.549 3rd Qu.: 27.00 3rd Qu.: 55.00
## Max. :934.0 Max. :17.654 Max. :239.00 Max. :670.81
# Distribution of Magnitude
ggplot(earthquake_df, aes(x = magnitude)) +
geom_histogram(binwidth = 0.5, fill = "skyblue", color = "black") +
labs(title = "Distribution of Earthquake Magnitudes", x = "Magnitude (Richter Scale)", y = "Frequency") +
theme_ipsum()# Frequency over time
ggplot(earthquake_df, aes(x = date, fill= 'skyblue')) +
geom_histogram(binwidth = 30, fill = "skyblue", color = "black") +
labs(title = "Earthquake Frequency Over Time",
x = "Date",
y = "Frequency") +
theme_ipsum()One interesting factor in this plot is the frequency of events every 5 years. It seems to increase a lot and then go back down. However, around the years between 2012 and 2018 they seem to increase by a severe amount. That leads to speculations on if there was something specific happening in the world those years that made the frequency exceed?
Since we have the latitude and longitude coordinates, we are able to mark the exact location of the earthquakes on a world map.
# Geographical distribution
# Making the map
my_map <- leaflet(data = earthquake_df) %>%
addTiles() %>%
setView(lng = mean(earthquake_df$longitude), lat = mean(earthquake_df$latitude), zoom = 1)
# Adding the earthquakes
my_map <- my_map %>%
addCircleMarkers(
lng = ~longitude,
lat = ~latitude,
radius = 5,
color = ~colorNumeric(palette = "RdYlBu", domain = earthquake_df$magnitude)(magnitude),
popup = ~as.character(magnitude)
)
my_map# Correlation matrix
cor_matrix <- cor(earthquake_df[, c("magnitude", "cdi", "mmi", "sig", "nst", "dmin", "gap", "depth", "latitude", "longitude")])
corrplot::corrplot(cor_matrix, method = "color")# Plot of linear relationship between Magnitude and Significance
ggplot(earthquake_df, aes(x = magnitude, y = sig, color = tsunami)) +
geom_point(alpha = 0.7, size = 3) +
geom_smooth(method = "lm", se = FALSE, color = 'red') +
scale_color_manual(values = c("0" = "skyblue", "1" = "pink"), name = "Tsunami") +
labs(
title = "The Linear Relationship between Magnitude and Significance",
x = "Magnitude (Richter Scale)",
y = "Significance"
) +
theme_ipsum()## `geom_smooth()` using formula = 'y ~ x'
earthquake_tsunami_1 <- subset(earthquake_df, tsunami == "1")
earthquake_tsunami_0 <- subset(earthquake_df, tsunami == "0")
tsunamiPlot <- ggplot(earthquake_tsunami_1, aes(x = magnitude, y = sig)) +
geom_point(alpha = 0.7, size = 3, color = 'pink') +
geom_smooth(method = "lm", se = FALSE, color = 'red') +
labs(
subtitle = "Scatter Plot of earthquakes in oceanic region",
x = "Magnitude",
y = "Significance"
) +
theme_ipsum()
NontsunamiPlot <- ggplot(earthquake_tsunami_0, aes(x = magnitude, y = sig)) +
geom_point(alpha = 0.7, size = 3, color = 'skyblue') +
geom_smooth(method = "lm", se = FALSE, color = 'blue') +
labs(
subtitle = "Scatter Plot of earthquakes on land",
x = "Magnitude",
y = "Significance"
) +
theme_ipsum()
plot <- ggarrange(NontsunamiPlot, tsunamiPlot,
common.legend = FALSE,
ncol = 2, nrow = 1, legend = "bottom")## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
annotate_figure(plot, top = text_grob("Magnitude and Significance",
color = "black", face = "bold", size = 16))##
## Call:
## lm(formula = magnitude ~ poly(sig, 5), data = earthquake_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.00949 -0.07219 0.02281 0.07926 1.80233
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.94599 0.01061 654.408 < 2e-16 ***
## poly(sig, 5)1 5.86109 0.29068 20.163 < 2e-16 ***
## poly(sig, 5)2 -4.66295 0.29068 -16.041 < 2e-16 ***
## poly(sig, 5)3 3.85688 0.29068 13.268 < 2e-16 ***
## poly(sig, 5)4 -2.88756 0.29068 -9.934 < 2e-16 ***
## poly(sig, 5)5 1.46511 0.29068 5.040 5.84e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2907 on 744 degrees of freedom
## Multiple R-squared: 0.5644, Adjusted R-squared: 0.5615
## F-statistic: 192.8 on 5 and 744 DF, p-value: < 2.2e-16
siglims <- range(sig)
sig.grid <- seq(from = siglims[1], to = siglims[2])
preds <- predict(poly_mag_fit, newdata = list(sig = sig.grid), se = TRUE)
se.bands <- cbind(preds$fit + 2 * preds$se.fit, preds$fit - 2 * preds$se.fit) # Standard error
#par(mfrow = c(1, 2), mar = c(4.5, 4.5, 1, 1), oma = c(0, 0, 4, 0))
plot(1, type = 'n', xlim = siglims, ylim = range(magnitude), xlab = 'Sig', ylab = 'Magnitude')
points(sig, magnitude, cex = 0.5, col = 'darkgrey')
title('Degree-5 Polynomial', outer = FALSE)
lines(sig.grid, preds$fit, lwd = 2, col = 'blue')
matlines(sig.grid, se.bands, lwd = 1, col = 'blue', lty = 3)ggplot(earthquake_df, aes(x = factor(tsunami), fill = tsunami)) +
geom_bar() +
labs(title = "Tsunami Occurrences", x = "Tsunami", y = "Frequency") +
scale_x_discrete(labels = c("No", "Yes")) +
scale_fill_manual(values = c("lightblue", "lightcoral"),
name = "Eartquake",
labels = c("On Land", "In Oceanic Area")) +
theme_minimal()# training the model
logistic_model <- glm(tsunami ~ magnitude + depth + alert, data = earthquake_train, family = "binomial")
summary(logistic_model)##
## Call:
## glm(formula = tsunami ~ magnitude + depth + alert, family = "binomial",
## data = earthquake_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1205 -0.1655 -0.1360 0.7093 3.3330
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.2708149 2.3084839 -4.016 5.92e-05 ***
## magnitude 0.6980946 0.3166913 2.204 0.0275 *
## depth -0.0015013 0.0007288 -2.060 0.0394 *
## alertgreen 5.7206859 0.4883374 11.715 < 2e-16 ***
## alertorange 4.7547915 0.6720129 7.075 1.49e-12 ***
## alertred 4.0431907 0.7617617 5.308 1.11e-07 ***
## alertyellow 5.0614129 0.5648841 8.960 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 941.81 on 749 degrees of freedom
## Residual deviance: 436.23 on 743 degrees of freedom
## AIC: 450.23
##
## Number of Fisher Scoring iterations: 7
After a variable of testing, we came to the conclusion that the best predictor variables are magnitude, depth and alert. They are all statistically significant with a p-value threshold of 0.05. Since alert is a categorical variable there are 5 different equation that come out of the model to predict a tsunami. Each one represents one of the 5 factor levels of the alert variable:
Green:
Yellow:
Orange:
Red:
Empty:
However to calculate the probability of a tsunami we fill the equation above into the following equation:
The equation indicates that the higher the magnitude of the earthquake, the more likely it is to be a tsunami.
Now it is time to test the model
# Testing the model
logistic_prob = predict(logistic_model, earthquake_test, type = 'response')
logistic_pred = rep('0', 250)
logistic_pred[logistic_prob>0.5] = '1'
# confusion matrix
table(logistic_pred, tsunami_test)## tsunami_test
## logistic_pred 0 1
## 0 132 1
## 1 34 83
# Testing the Matrix
accuracy = mean(logistic_pred==tsunami_test)
precision = 131 /(131 + 83)
# Printing the Results
cat("Accuracy: ", accuracy, "\n",
"Precision: ", precision, "\n")## Accuracy: 0.86
## Precision: 0.6121495
The above results show the confusion matrix of the logistic model. From the results the accuracy and precision are calculated.
Accuracy
Precision