This data-set contains historical earthquakes from multiple regions of the world. Variables include earthquake magnitude on the Richter scale, number of deaths, year, region, and geographic area. My interest in this data set comes from my general shock at seeing nature’s power. I’ve always been impressed by strong rivers flowing , a volcano bursting, etc… . I’ll be looking at what factors is associated most with earthquakes deaths. The read me file doesn’t how the data was collected Source:https://www.openintro.org/data/index.php?data=earthquakes, World Almanac and book of Facts.
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.5.3
Warning: package 'readr' was built under R version 4.5.3
Warning: package 'dplyr' was built under R version 4.5.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.1 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)library(highcharter)
Warning: package 'highcharter' was built under R version 4.5.3
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
year month day richter area region deaths
1 1902 April 19 7.5 Quezaltenango and San Marco Guatemala 2000
2 1902 December 16 6.4 Uzbekistan Russia 4700
3 1903 April 28 7.0 Malazgirt Turkey 3500
4 1903 May 28 5.8 Gole Turkey 1000
5 1905 April 4 7.5 Kangra India 19000
6 1906 January 31 8.8 Esmeraldas (off coast) Ecuador 1000
Exploring and cleaning:
summary(earthquakes)
year month day richter
Min. :1902 Length:123 Min. : 1.00 Min. :5.500
1st Qu.:1933 Class :character 1st Qu.:10.00 1st Qu.:6.800
Median :1960 Mode :character Median :17.00 Median :7.200
Mean :1957 Mean :16.98 Mean :7.128
3rd Qu.:1981 3rd Qu.:25.00 3rd Qu.:7.500
Max. :1999 Max. :31.00 Max. :9.500
area region deaths
Length:123 Length:123 Min. : 3
Class :character Class :character 1st Qu.: 1250
Mode :character Mode :character Median : 2790
Mean : 17683
3rd Qu.: 8000
Max. :700000
NA's :2
colSums(is.na(earthquakes))
year month day richter area region deaths
0 0 0 0 4 1 2
Here, we use “filter()” and “!is.na ()”null values are filtered out for more accurate results in later plots. We did this based on results from “colSums(!is.na()).
This code groups the earthquake data by region and summarizes key patterns within each region. The median number of deaths was calculated to reduce the influence of extreme outliers from unusually catastrophic earthquakes, while the average Richter magnitude provided a measure of the typical earthquake strength in each region. The count variable was included to show how many earthquakes were recorded in each region. Creating this summarized data-set made it easier to compare regions directly and helped reveal broader geographic patterns that would have been difficult to identify from the raw earthquake-level data alone.
Regression model:
model <-lm(deaths ~ richter,data = earthquakes_clean)summary(model)
Call:
lm(formula = deaths ~ richter, data = earthquakes_clean)
Residuals:
Min 1Q Median 3Q Max
-54202 -19103 -12668 -1914 670079
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -98136 67894 -1.445 0.1511
richter 16210 9428 1.719 0.0883 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 71960 on 114 degrees of freedom
Multiple R-squared: 0.02527, Adjusted R-squared: 0.01672
F-statistic: 2.956 on 1 and 114 DF, p-value: 0.08827
Although stronger earthquakes would be expected to cause significantly more deaths, the regression model showed a much weaker relationship than anticipated. While the coefficient for Richter magnitude was positive (16,210), the relationship was only marginally significant (p = 0.088) and the model explained just 2.5% of the variation in deaths (R² = 0.025). This suggests that earthquake magnitude alone is not enough to explain fatalities, and that other factors likely play an important role.
Visualization 1:
ggplot(region_summary,aes(x = avg_richter,y = median_deaths,color = avg_richter,)) +geom_point(alpha =0.7) +scale_color_gradient(low ="skyblue", high ="darkred") +scale_y_log10()+labs(title ="Average Earthquake Magnitude vs Median Deaths by Region",subtitle ="Regions with stronger earthquakes often experienced greater fatalities",x ="Average Richter Magnitude",y ="Median Deaths (Log Scale)",color ="Average Magnitude",caption ="Source: Earthquake dataset" ) +theme_minimal()
This visualization explores the relationship between average earthquake magnitude and median deaths across regions. Using median deaths and a logarithmic scale reduced the influence of extreme outliers and made regional differences easier to compare. The plot shows a general positive relationship, where regions with stronger average earthquakes tended to experience higher median death counts. However, the spread of the points also suggests that other factors, such as infrastructure, population density, and disaster preparedness, may influence earthquake fatalities.
This code groups the earthquake data by region and filters out regions with very few earthquake observations. Keeping only regions with at least five recorded earthquakes reduced clutter in the visualization and allowed more meaningful regional comparisons. This helped focus the analysis on regions with enough data to better observe patterns between earthquake magnitude and deaths.
hchart( earthquakes_top,"scatter",hcaes(x = richter,y = deaths,group = region )) %>%hc_title(text ="Earthquake Magnitude and Deaths by Region") %>%hc_xAxis(title =list(text ="Richter Magnitude")) %>%hc_yAxis(title =list(text ="Deaths")) %>%hc_colors(brewer.pal(8, "Set2")) %>%hc_add_theme(hc_theme_smpl())
This interactive scatterplot explores the relationship between earthquake magnitude and deaths across regions with higher earthquake activity. Each point represents an earthquake, while the colors distinguish different regions. The visualization supports the regression findings by showing that stronger earthquakes do not always lead to higher fatalities, suggesting that additional geographic and social factors may influence earthquake impacts.
This Tableau visualization compares the frequency of earthquakes across different regions. By displaying the total number of recorded earthquakes in each region, the chart highlights which areas experienced earthquake activity most often. The visualization provides additional context for the project by showing that earthquake impacts are not only influenced by magnitude , but also by how frequently earthquakes occur within certain regions.
The regression analysis showed a much weaker relationship between earthquake magnitude and deaths than expected, with an R² value of only 0.025. This suggests that earthquake magnitude alone cannot explain fatalities. Outside research supports this finding, emphasizing that infrastructure quality, building safety, and disaster preparedness often play a major role in determining earthquake deaths.