Week 2 Data Dive

Earthquakes Data Set

Numeric Summary of The Data

First we need to import the library tidyverse that will help us transform and present our data

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Second we import our Earthquakes Data Set to memory by reading the csv file

quakes <- read_delim("./quakes.csv")

## Rows: 18334 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (8): magType, net, id, place, type, status, locationSource, magSource
## dbl  (12): latitude, longitude, depth, mag, nst, gap, dmin, rms, horizontalE...
## dttm  (2): time, updated
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We can now get a summary of the data that we have using the code below:

summary(quakes[c("mag", "depth")])

##       mag            depth       
##  Min.   :5.000   Min.   : -1.01  
##  1st Qu.:5.100   1st Qu.: 10.00  
##  Median :5.200   Median : 13.63  
##  Mean   :5.341   Mean   : 53.02  
##  3rd Qu.:5.500   3rd Qu.: 45.53  
##  Max.   :8.300   Max.   :670.81

From the summary above we can note the following insights.
- The maximum depth recorded for all earthquakes between 2013 and 2023 is 670.81 km and the minimum depth is -1.01 km. The concern one might have here is whether depth of an earthquake could be negative.
- The mean depth is 53.02 km implying that the average (central tendency) of the earthquakes data set is 53.02 km.
- From the summary provided by running the previous code snippet something that is missing that would be relevant to us is getting the unique types of the measurement scales used.From above we realize that the same magnitude scales are treated differently simply because of case sensitivity. Hence we need to optimize the above function in order to handle this scenario:
```
unique_mag_types <- quakes |>
  select(magType) |>
  distinct()
unique_mag_types
```
```
## # A tibble: 13 × 1
##    magType
##    <chr>  
##  1 mb     
##  2 mwb    
##  3 mw     
##  4 mww    
##  5 mwr    
##  6 mwc    
##  7 ms     
##  8 ml     
##  9 Md     
## 10 Ml     
## 11 ms_20  
## 12 mwp    
## 13 Mi
```
We now have have the types of magnitude scales used. However, we notice one problem. For the same scale, they are treated as two different types of scales which is incorrect. We can now modify the above function as below to handle case sensitivity.

unique_mag_types <- quakes |>
  select(magType) |>
  mutate(magType = tolower(magType)) |>
  distinct()
unique_mag_types

## # A tibble: 12 × 1
##    magType
##    <chr>  
##  1 mb     
##  2 mwb    
##  3 mw     
##  4 mww    
##  5 mwr    
##  6 mwc    
##  7 ms     
##  8 ml     
##  9 md     
## 10 ms_20  
## 11 mwp    
## 12 mi

Novel Questions to Investigate

Novel Question 1

Is there a correlation between depth at which the earthquake originated and the magnitude of the earthquake. Here we want to investigate whether earthquakes originating deeper from the earth’s surface have have more energy or the vice versa.
- For this question, we can use the cor() function to evaluate the correction between depth and magnitude as shown in the code snippet below:

quakes |>
  drop_na(mag, depth, magType) |>
  group_by(magType) |>
  summarise(
    Correlation <- cor(mag, depth),
    Count <- n()
    )

## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `Correlation <- cor(mag, depth)`.
## ℹ In group 6: `magType = "ms"`.
## Caused by warning in `cor()`:
## ! the standard deviation is zero

## # A tibble: 13 × 3
##    magType `Correlation <- cor(mag, depth)` `Count <- n()`
##    <chr>                              <dbl>          <int>
##  1 Md                              NA                    1
##  2 Mi                               1                    2
##  3 Ml                              NA                    1
##  4 mb                               0.00838           8346
##  5 ml                               0.143               55
##  6 ms                              NA                    3
##  7 ms_20                           -0.533                3
##  8 mw                               0.0748             158
##  9 mwb                              0.183              593
## 10 mwc                             -0.0647             228
## 11 mwp                              0.139                6
## 12 mwr                             -0.157              231
## 13 mww                              0.127             8707

From the results above we see that the groups with relative high correlation have significantly very few number of records for that group. This could significantly affect the accuracy of our analysis. For that reason we’ll filter out groups with less than 100 records. The code is adjusted as below:

quakes |>
  drop_na(mag, depth, magType) |>
  group_by(magType) |>
  filter(n()>100) |>
  summarise(
    Correlation <- cor(mag, depth),
    Count <- n()
    )

## # A tibble: 6 × 3
##   magType `Correlation <- cor(mag, depth)` `Count <- n()`
##   <chr>                              <dbl>          <int>
## 1 mb                               0.00838           8346
## 2 mw                               0.0748             158
## 3 mwb                              0.183              593
## 4 mwc                             -0.0647             228
## 5 mwr                             -0.157              231
## 6 mww                              0.127             8707

From above results, we see that the highest correlation is approximately \(0.2\) hence, an indication of weak correlation between depth and magnitude of the recorded earthquakes. We therefore conclude that there is little or no linear association between the two variables.

Novel Question 2

Are there specific geographic regions that are more probe to strong earthquakes, or is the distribution of strong earthquakes just random?
- Although different scales have been used to measure the magnitude we’ll treat the the \(mag\) variable as of same scale since it doesn’t matter as seen from USGS Website To get earthquakes that we consider strong we need to filter for \(mag>5\). Finally, we then make a plot of the distribution.

# Filter for strong earthquakes
strong_quakes <- quakes |>
  filter(mag >= 6.0)

# Plotting the geographic distribution of strong earthquakes
ggplot(strong_quakes, aes(x <- longitude, y <- latitude)) +
  geom_point(aes(color = mag), alpha = 0.6) +
  labs(title = "Geographic Distribution of Strong Earthquakes",
       x = "Longitude", y = "Latitude") +
  scale_color_viridis_c() +  # This adds a color gradient based on magnitude
  theme_minimal()

We can improve the above visualization by creating bins (intervals) for both longitudes and latitudes as shown below:

  # Creating the Geographic regions by creating longitud and latitude bins
  strong_quakes_normalized = strong_quakes |>
    mutate(
      lat_bin = cut(latitude, breaks = seq(-90, 90, by = 36), labels = c("Very South", "South", "Equatorial", "North", "Very North"), include.lowest = TRUE),
      lon_bin = cut(longitude, breaks = seq(-180, 180, by = 60), labels = c("Far West", "Mid West", "West", "Central", "East", "Far East"), include.lowest = TRUE)
    ) |>
    group_by(lat_bin, lon_bin) |>
    summarize(count = n(), .groups = 'drop')

  head(strong_quakes_normalized)

## # A tibble: 6 × 3
##   lat_bin    lon_bin  count
##   <fct>      <fct>    <int>
## 1 Very South Far West    12
## 2 Very South Mid West     1
## 3 Very South West        52
## 4 Very South Central      3
## 5 Very South Far East    15
## 6 South      Far West   127

  # ploting the normalized data using ggplot() function
  ggplot(strong_quakes_normalized, aes(x = lon_bin, y= lat_bin)) +
    geom_tile(aes(fill = count)) +
    labs(
      title = "Strong Earthquakes Distribution by Regions",
      x = "Longitude",
      y = "Latitude",
      color = "Magnitude"
    ) +
    scale_fill_viridis_c() +
    theme_minimal()

From the plot above we see that there is a strong indication of earthquakes of high magnitude occurring around the Equatorial - Far East Region.

Novel Question 3

Is there a correlation between the magnitude of the earthquake and number of stations that reported the earthquake. Here we would expect to have a relatively higher number of stations report earthquakes of higher depth.

ggplot(quakes, aes(x = mag, y=nst), alpha=0,6) +
  geom_point(aes(color=mag)) +
  labs(
    title = "Relationship between Magnitude and #of Stations",
    x = "Magnitude",
    y ="#of Stations"
  ) +
  scale_fill_viridis_c() +
  theme_minimal()

## Warning: Removed 14759 rows containing missing values or values outside the scale range
## (`geom_point()`).

From the above plot, we can’t really tell with a strong confidence that there is a relationship between magnitude and the number of stations that reported the earthquake. One way to investigate this further is creating bins(intervals) for the mag variable, and taking an average of stations for each category as opposed to taking the count (very important). Finally, we make a geom_tile() plot.

# we create the bins for the magnitude

quakes_mag_bin <- quakes |>
  mutate(
    mag_bin =  cut(mag, breaks = seq(min(mag), max(mag), length.out=5), include.lowest = TRUE, labels = c("Weak", "Moderate", "Strong", "Extremely Strong"))
  ) |>
  group_by(mag_bin) |>
  summarise(count = n(), average_nst = mean(nst, na.rm = TRUE), .groups ='drop')
head(quakes_mag_bin)

## # A tibble: 4 × 3
##   mag_bin          count average_nst
##   <fct>            <int>       <dbl>
## 1 Weak             16419        131.
## 2 Moderate          1598        228.
## 3 Strong             261        258.
## 4 Extremely Strong    56        304.

#Plotting the data
ggplot(quakes_mag_bin, aes(x=mag_bin, average_nst)) +
  geom_tile(aes(fill=average_nst)) +
  labs(
    title = "Relationship betwen Magnitude and # Stations",
    x ="Mag bin",
    y ="average stations"
  ) +
  scale_fill_viridis_c() +
  theme_minimal()

From above, it is very clear that the number of seismic stations that report a certain earthquake is directly proportional to its strength. This confirms our expectation that stronger earthquakes should be reported by more stations because they tend to reach far.

Conclusion

We explored the Earthquakes Data Set and we were able to obtain key summaries from the data, categorize continuous data by creating bins, aggregating continuous data, thus deriving key insights from the data. Overall, this was really interesting. However, even more insights can be derived by exploring the data further.