First we need to import the library tidyverse that will help us transform and present our data
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errorsSecond we import our Earthquakes Data Set to memory by reading the csv file
quakes <- read_delim("./quakes.csv")
## Rows: 18334 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): magType, net, id, place, type, status, locationSource, magSource
## dbl (12): latitude, longitude, depth, mag, nst, gap, dmin, rms, horizontalE...
## dttm (2): time, updated
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
We can now get a summary of the data that we have using the code below:
summary(quakes[c("mag", "depth")])
## mag depth
## Min. :5.000 Min. : -1.01
## 1st Qu.:5.100 1st Qu.: 10.00
## Median :5.200 Median : 13.63
## Mean :5.341 Mean : 53.02
## 3rd Qu.:5.500 3rd Qu.: 45.53
## Max. :8.300 Max. :670.81From the summary above we can note the following insights.
The maximum depth recorded for all earthquakes between 2013 and 2023 is 670.81 km and the minimum depth is -1.01 km. The concern one might have here is whether depth of an earthquake could be negative.
The mean depth is 53.02 km implying that the average (central tendency) of the earthquakes data set is 53.02 km.
From the summary provided by running the previous code snippet something that is missing that would be relevant to us is getting the unique types of the measurement scales used.From above we realize that the same magnitude scales are treated differently simply because of case sensitivity. Hence we need to optimize the above function in order to handle this scenario:
unique_mag_types <- quakes |>
select(magType) |>
distinct()
unique_mag_types
## # A tibble: 13 × 1
## magType
## <chr>
## 1 mb
## 2 mwb
## 3 mw
## 4 mww
## 5 mwr
## 6 mwc
## 7 ms
## 8 ml
## 9 Md
## 10 Ml
## 11 ms_20
## 12 mwp
## 13 MiWe now have have the types of magnitude scales used. However, we notice one problem. For the same scale, they are treated as two different types of scales which is incorrect. We can now modify the above function as below to handle case sensitivity.
unique_mag_types <- quakes |>
select(magType) |>
mutate(magType = tolower(magType)) |>
distinct()
unique_mag_types
## # A tibble: 12 × 1
## magType
## <chr>
## 1 mb
## 2 mwb
## 3 mw
## 4 mww
## 5 mwr
## 6 mwc
## 7 ms
## 8 ml
## 9 md
## 10 ms_20
## 11 mwp
## 12 mi
Is there a correlation between depth at which the earthquake originated and the magnitude of the earthquake. Here we want to investigate whether earthquakes originating deeper from the earth’s surface have have more energy or the vice versa.
cor() function to
evaluate the correction between depth and magnitude as shown in the code
snippet below:quakes |>
drop_na(mag, depth, magType) |>
group_by(magType) |>
summarise(
Correlation <- cor(mag, depth),
Count <- n()
)
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `Correlation <- cor(mag, depth)`.
## ℹ In group 6: `magType = "ms"`.
## Caused by warning in `cor()`:
## ! the standard deviation is zero
## # A tibble: 13 × 3
## magType `Correlation <- cor(mag, depth)` `Count <- n()`
## <chr> <dbl> <int>
## 1 Md NA 1
## 2 Mi 1 2
## 3 Ml NA 1
## 4 mb 0.00838 8346
## 5 ml 0.143 55
## 6 ms NA 3
## 7 ms_20 -0.533 3
## 8 mw 0.0748 158
## 9 mwb 0.183 593
## 10 mwc -0.0647 228
## 11 mwp 0.139 6
## 12 mwr -0.157 231
## 13 mww 0.127 8707
quakes |>
drop_na(mag, depth, magType) |>
group_by(magType) |>
filter(n()>100) |>
summarise(
Correlation <- cor(mag, depth),
Count <- n()
)
## # A tibble: 6 × 3
## magType `Correlation <- cor(mag, depth)` `Count <- n()`
## <chr> <dbl> <int>
## 1 mb 0.00838 8346
## 2 mw 0.0748 158
## 3 mwb 0.183 593
## 4 mwc -0.0647 228
## 5 mwr -0.157 231
## 6 mww 0.127 8707
Are there specific geographic regions that are more probe to strong earthquakes, or is the distribution of strong earthquakes just random?
# Filter for strong earthquakes
strong_quakes <- quakes |>
filter(mag >= 6.0)
# Plotting the geographic distribution of strong earthquakes
ggplot(strong_quakes, aes(x <- longitude, y <- latitude)) +
geom_point(aes(color = mag), alpha = 0.6) +
labs(title = "Geographic Distribution of Strong Earthquakes",
x = "Longitude", y = "Latitude") +
scale_color_viridis_c() + # This adds a color gradient based on magnitude
theme_minimal()
We can improve the above visualization by creating bins (intervals) for both longitudes and latitudes as shown below:
# Creating the Geographic regions by creating longitud and latitude bins
strong_quakes_normalized = strong_quakes |>
mutate(
lat_bin = cut(latitude, breaks = seq(-90, 90, by = 36), labels = c("Very South", "South", "Equatorial", "North", "Very North"), include.lowest = TRUE),
lon_bin = cut(longitude, breaks = seq(-180, 180, by = 60), labels = c("Far West", "Mid West", "West", "Central", "East", "Far East"), include.lowest = TRUE)
) |>
group_by(lat_bin, lon_bin) |>
summarize(count = n(), .groups = 'drop')
head(strong_quakes_normalized)
## # A tibble: 6 × 3
## lat_bin lon_bin count
## <fct> <fct> <int>
## 1 Very South Far West 12
## 2 Very South Mid West 1
## 3 Very South West 52
## 4 Very South Central 3
## 5 Very South Far East 15
## 6 South Far West 127
# ploting the normalized data using ggplot() function
ggplot(strong_quakes_normalized, aes(x = lon_bin, y= lat_bin)) +
geom_tile(aes(fill = count)) +
labs(
title = "Strong Earthquakes Distribution by Regions",
x = "Longitude",
y = "Latitude",
color = "Magnitude"
) +
scale_fill_viridis_c() +
theme_minimal()
ggplot(quakes, aes(x = mag, y=nst), alpha=0,6) +
geom_point(aes(color=mag)) +
labs(
title = "Relationship between Magnitude and #of Stations",
x = "Magnitude",
y ="#of Stations"
) +
scale_fill_viridis_c() +
theme_minimal()
## Warning: Removed 14759 rows containing missing values or values outside the scale range
## (`geom_point()`).
From the above plot, we can’t really tell with a strong confidence that there is a relationship between magnitude and the number of stations that reported the earthquake. One way to investigate this further is creating bins(intervals) for the mag variable, and taking an average of stations for each category as opposed to taking the count (very important). Finally, we make a geom_tile() plot.
# we create the bins for the magnitude
quakes_mag_bin <- quakes |>
mutate(
mag_bin = cut(mag, breaks = seq(min(mag), max(mag), length.out=5), include.lowest = TRUE, labels = c("Weak", "Moderate", "Strong", "Extremely Strong"))
) |>
group_by(mag_bin) |>
summarise(count = n(), average_nst = mean(nst, na.rm = TRUE), .groups ='drop')
head(quakes_mag_bin)
## # A tibble: 4 × 3
## mag_bin count average_nst
## <fct> <int> <dbl>
## 1 Weak 16419 131.
## 2 Moderate 1598 228.
## 3 Strong 261 258.
## 4 Extremely Strong 56 304.
#Plotting the data
ggplot(quakes_mag_bin, aes(x=mag_bin, average_nst)) +
geom_tile(aes(fill=average_nst)) +
labs(
title = "Relationship betwen Magnitude and # Stations",
x ="Mag bin",
y ="average stations"
) +
scale_fill_viridis_c() +
theme_minimal()
From above, it is very clear that the number of seismic stations that report a certain earthquake is directly proportional to its strength. This confirms our expectation that stronger earthquakes should be reported by more stations because they tend to reach far.
We explored the Earthquakes Data Set and we were able to obtain key summaries from the data, categorize continuous data by creating bins, aggregating continuous data, thus deriving key insights from the data. Overall, this was really interesting. However, even more insights can be derived by exploring the data further.