Final Project

Author

Surafel Haile

Earthquake Analysis :

Lisbon earthquake 1755. Source: Wikimedia Commons.

Introduction :

This data-set contains historical earthquakes from multiple regions of the world. Variables include earthquake magnitude on the Richter scale, number of deaths, year, region, and geographic area. My interest in this data set comes from my general shock at seeing nature’s power. I’ve always been impressed by strong rivers flowing , a volcano bursting, etc… . I’ll be looking at what factors is associated most with earthquakes deaths. The read me file doesn’t how the data was collected Source: https://www.openintro.org/data/index.php?data=earthquakes, World Almanac and book of Facts.

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.5.3

Warning: package 'readr' was built under R version 4.5.3

Warning: package 'dplyr' was built under R version 4.5.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(highcharter)

Warning: package 'highcharter' was built under R version 4.5.3

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo

library(RColorBrewer)

earthquakes<- read.csv("earthquakes.csv")
head(earthquakes)

  year    month day richter                        area    region deaths
1 1902    April  19     7.5 Quezaltenango and San Marco Guatemala   2000
2 1902 December  16     6.4                  Uzbekistan    Russia   4700
3 1903    April  28     7.0                   Malazgirt    Turkey   3500
4 1903      May  28     5.8                        Gole    Turkey   1000
5 1905    April   4     7.5                      Kangra     India  19000
6 1906  January  31     8.8      Esmeraldas (off coast)   Ecuador   1000

Exploring and cleaning:

summary(earthquakes)

      year         month                day           richter     
 Min.   :1902   Length:123         Min.   : 1.00   Min.   :5.500  
 1st Qu.:1933   Class :character   1st Qu.:10.00   1st Qu.:6.800  
 Median :1960   Mode  :character   Median :17.00   Median :7.200  
 Mean   :1957                      Mean   :16.98   Mean   :7.128  
 3rd Qu.:1981                      3rd Qu.:25.00   3rd Qu.:7.500  
 Max.   :1999                      Max.   :31.00   Max.   :9.500  
                                                                  
     area              region              deaths      
 Length:123         Length:123         Min.   :     3  
 Class :character   Class :character   1st Qu.:  1250  
 Mode  :character   Mode  :character   Median :  2790  
                                       Mean   : 17683  
                                       3rd Qu.:  8000  
                                       Max.   :700000  
                                       NA's   :2

colSums(is.na(earthquakes))

   year   month     day richter    area  region  deaths 
      0       0       0       0       4       1       2

earthquakes_clean <- earthquakes %>% 
  filter(!is.na(deaths),
         !is.na(region),
         !is.na(area))

Here, we use “filter()” and “!is.na ()”null values are filtered out for more accurate results in later plots. We did this based on results from “colSums(!is.na()).

region_summary <- earthquakes_clean %>%
  group_by(region) %>%
  summarize(
    median_deaths = median(deaths),
    avg_richter = mean(richter),
    count = n()
  )
head(region_summary)

# A tibble: 6 × 4
  region                 median_deaths avg_richter count
  <chr>                          <dbl>       <dbl> <int>
1 Afghanistan                    2162.        6.8      2
2 Afghanistan-Tajikistan         4000         6.6      1
3 Algeria                        3125         7.25     2
4 Argentina                      8000         7.4      1
5 Armenia                       25000         6.8      1
6 Armenia-Azerbaijan             2800         5.7      1

This code groups the earthquake data by region and summarizes key patterns within each region. The median number of deaths was calculated to reduce the influence of extreme outliers from unusually catastrophic earthquakes, while the average Richter magnitude provided a measure of the typical earthquake strength in each region. The count variable was included to show how many earthquakes were recorded in each region. Creating this summarized data-set made it easier to compare regions directly and helped reveal broader geographic patterns that would have been difficult to identify from the raw earthquake-level data alone.

Regression model:

model <- lm(deaths ~ richter,
            data = earthquakes_clean)
summary(model)


Call:
lm(formula = deaths ~ richter, data = earthquakes_clean)

Residuals:
   Min     1Q Median     3Q    Max 
-54202 -19103 -12668  -1914 670079 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   -98136      67894  -1.445   0.1511  
richter        16210       9428   1.719   0.0883 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 71960 on 114 degrees of freedom
Multiple R-squared:  0.02527,   Adjusted R-squared:  0.01672 
F-statistic: 2.956 on 1 and 114 DF,  p-value: 0.08827

Although stronger earthquakes would be expected to cause significantly more deaths, the regression model showed a much weaker relationship than anticipated. While the coefficient for Richter magnitude was positive (16,210), the relationship was only marginally significant (p = 0.088) and the model explained just 2.5% of the variation in deaths (R² = 0.025). This suggests that earthquake magnitude alone is not enough to explain fatalities, and that other factors likely play an important role.

Visualization 1:

ggplot(region_summary,
       aes(x = avg_richter,
           y = median_deaths,
           color = avg_richter,)) +
  geom_point(alpha = 0.7) +
  scale_color_gradient(low = "skyblue", high = "darkred") +
  scale_y_log10()+
  labs(
    title = "Average Earthquake Magnitude vs Median Deaths by Region",
    subtitle = "Regions with stronger earthquakes often experienced greater fatalities",
    x = "Average Richter Magnitude",
    y = "Median Deaths (Log Scale)",
    color = "Average Magnitude",
    caption = "Source: Earthquake dataset"
  ) +
  theme_minimal()

This visualization explores the relationship between average earthquake magnitude and median deaths across regions. Using median deaths and a logarithmic scale reduced the influence of extreme outliers and made regional differences easier to compare. The plot shows a general positive relationship, where regions with stronger average earthquakes tended to experience higher median death counts. However, the spread of the points also suggests that other factors, such as infrastructure, population density, and disaster preparedness, may influence earthquake fatalities.

Visualization 2:

earthquakes_top <- earthquakes_clean %>%
  group_by(region) %>%
  filter(n() >= 5)

This code groups the earthquake data by region and filters out regions with very few earthquake observations. Keeping only regions with at least five recorded earthquakes reduced clutter in the visualization and allowed more meaningful regional comparisons. This helped focus the analysis on regions with enough data to better observe patterns between earthquake magnitude and deaths.

hchart(
  earthquakes_top,
  "scatter",
  hcaes(
    x = richter,
    y = deaths,
    group = region
  )
) %>%
  hc_title(text = "Earthquake Magnitude and Deaths by Region") %>%
  hc_xAxis(title = list(text = "Richter Magnitude")) %>%
  hc_yAxis(title = list(text = "Deaths")) %>%
  hc_colors(brewer.pal(8, "Set2")) %>% 
  hc_add_theme(hc_theme_smpl())

This interactive scatterplot explores the relationship between earthquake magnitude and deaths across regions with higher earthquake activity. Each point represents an earthquake, while the colors distinguish different regions. The visualization supports the regression findings by showing that stronger earthquakes do not always lead to higher fatalities, suggesting that additional geographic and social factors may influence earthquake impacts.

Tableau Visualization :

Tableau Visualization

This Tableau visualization compares the frequency of earthquakes across different regions. By displaying the total number of recorded earthquakes in each region, the chart highlights which areas experienced earthquake activity most often. The visualization provides additional context for the project by showing that earthquake impacts are not only influenced by magnitude , but also by how frequently earthquakes occur within certain regions.

Outside Research :

Devraj, Ranjit. “Quakes Do Not Kill People, Bad Buildings Do.” PreventionWeb, 23 Apr. 2024, https://www.preventionweb.net/news/quakes-do-not-kill-people-bad-buildings-do. Accessed 17 May 2026.

The regression analysis showed a much weaker relationship between earthquake magnitude and deaths than expected, with an R² value of only 0.025. This suggests that earthquake magnitude alone cannot explain fatalities. Outside research supports this finding, emphasizing that infrastructure quality, building safety, and disaster preparedness often play a major role in determining earthquake deaths.