Global Disaster Data Analysis

1 Introduction

This project analyzes global disaster records from the EM-DAT International Disaster Database to examine patterns in disaster occurrence, regional distribution, and factors associated with disaster mortality. The analysis focuses on identifying the most frequent disaster types, regions with higher exposure, and the relationship between disaster magnitude, affected population, and mortality outcomes.

By combining exploratory analysis with statistical modeling, the study highlights key drivers of disaster impact and provides a structured view of global disaster risk patterns.

2 Objective

The objectives of this analysis are:

Identify the most frequent disaster types
Examine regional disaster distribution
Compare disaster severity using mortality impact
Evaluate the relationship between magnitude and deaths
Analyze population exposure patterns
Develop a simple regression model to study mortality drivers

3 Load Required Libraries

library(dplyr)
library(ggplot2)
library(readr)
library(stringr)
library(lubridate)
library(tidyr)
library(maps)
library(viridis)

theme_set(theme_minimal(base_size = 12))

4 Load Dataset

df <- read_csv(
"C:/Users/aj520/Downloads/public_emdat_2026-03-17.csv",
locale = locale(encoding="ISO-8859-1")
)

## Rows: 16765 Columns: 47
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (23): DisNo., Historic, Classification Key, Disaster Group, Disaster Su...
## dbl  (22): AID Contribution ('000 US$), Magnitude, Latitude, Longitude, Star...
## date  (2): Entry Date, Last Update
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

dim(df)

## [1] 16765    47

glimpse(df)

## Rows: 16,765
## Columns: 47
## $ DisNo.                                      <chr> "2018-0040-BRA", "2002-035…
## $ Historic                                    <chr> "No", "No", "No", "No", "N…
## $ `Classification Key`                        <chr> "nat-hyd-flo-flo", "nat-cl…
## $ `Disaster Group`                            <chr> "Natural", "Natural", "Nat…
## $ `Disaster Subgroup`                         <chr> "Hydrological", "Climatolo…
## $ `Disaster Type`                             <chr> "Flood", "Wildfire", "Floo…
## $ `Disaster Subtype`                          <chr> "Flood (General)", "Forest…
## $ `External IDs`                              <chr> "DFO:4576", NA, NA, NA, NA…
## $ `Event Name`                                <chr> NA, NA, NA, NA, NA, NA, "M…
## $ ISO                                         <chr> "BRA", "USA", "RWA", "USA"…
## $ Country                                     <chr> "Brazil", "United States o…
## $ Subregion                                   <chr> "Latin America and the Car…
## $ Region                                      <chr> "Americas", "Americas", "A…
## $ Location                                    <chr> "Rio de Janeiro", "Colorad…
## $ Origin                                      <chr> "Heavy rains", NA, NA, NA,…
## $ `Associated Types`                          <chr> "Collapse|Flood", NA, NA, …
## $ `OFDA/BHA Response`                         <chr> "No", "No", "No", "No", "N…
## $ Appeal                                      <chr> "No", "No", "No", "No", "N…
## $ Declaration                                 <chr> "No", "No", "No", "No", "N…
## $ `AID Contribution ('000 US$)`               <dbl> NA, NA, NA, NA, NA, NA, NA…
## $ Magnitude                                   <dbl> 55138.95, 770.00, NA, NA, …
## $ `Magnitude Scale`                           <chr> "Km2", "Km2", "Km2", "Km2"…
## $ Latitude                                    <dbl> -22.479, NA, NA, NA, NA, N…
## $ Longitude                                   <dbl> -44.095, NA, NA, NA, NA, N…
## $ `River Basin`                               <chr> NA, NA, NA, NA, NA, NA, NA…
## $ `Start Year`                                <dbl> 2018, 2002, 2022, 2024, 20…
## $ `Start Month`                               <dbl> 2, 6, 11, 1, 8, 9, 9, 4, 9…
## $ `Start Day`                                 <dbl> 14, 8, 17, NA, 31, 20, 8, …
## $ `End Year`                                  <dbl> 2018, 2002, 2022, 2024, 20…
## $ `End Month`                                 <dbl> 2, 6, 11, 12, 8, 9, 9, 4, …
## $ `End Day`                                   <dbl> 16, 8, 18, NA, 31, 21, 8, …
## $ `Total Deaths`                              <dbl> 4, NA, 3, NA, 10, NA, 20, …
## $ `No. Injured`                               <dbl> NA, NA, NA, NA, 20, NA, NA…
## $ `No. Affected`                              <dbl> 250, 1500, NA, NA, NA, 500…
## $ `No. Homeless`                              <dbl> NA, 72, NA, NA, NA, NA, NA…
## $ `Total Affected`                            <dbl> 250, 1572, NA, NA, 20, 500…
## $ `Reconstruction Costs ('000 US$)`           <dbl> NA, NA, NA, NA, NA, NA, NA…
## $ `Reconstruction Costs, Adjusted ('000 US$)` <dbl> NA, NA, NA, NA, NA, NA, NA…
## $ `Insured Damage ('000 US$)`                 <dbl> NA, NA, NA, NA, NA, NA, NA…
## $ `Insured Damage, Adjusted ('000 US$)`       <dbl> NA, NA, NA, NA, NA, NA, NA…
## $ `Total Damage ('000 US$)`                   <dbl> 10000, 20000, NA, 5400000,…
## $ `Total Damage, Adjusted ('000 US$)`         <dbl> 12492, 34879, NA, 5400000,…
## $ CPI                                         <dbl> 80.04960, 57.34184, 93.294…
## $ `Admin Units`                               <chr> "[{\"adm2_code\":9961,\"ad…
## $ `GADM Admin Units`                          <chr> "[{\"gid_2\":\"BRA.19.68_2…
## $ `Entry Date`                                <date> 2018-02-20, 2003-07-01, 2…
## $ `Last Update`                               <date> 2025-12-20, 2025-12-20, 2…

The dataset provides historical disaster records across countries and years, enabling analysis of disaster frequency, severity, and impact patterns.

5 Data Preparation

names(df) <- toupper(names(df))
names(df) <- str_replace_all(names(df)," ","_")
names(df) <- str_replace_all(names(df),"\\.","")

df$COUNTRY <- iconv(df$COUNTRY,"UTF-8","ASCII","")

df <- df %>%

select(
DISASTER_TYPE,
COUNTRY,
REGION,
MAGNITUDE,
TOTAL_DEATHS,
TOTAL_AFFECTED,
START_YEAR,
START_MONTH,
START_DAY
)

df <- df %>%

filter(
!is.na(DISASTER_TYPE),
!is.na(REGION),
!is.na(START_YEAR)
)

df$TOTAL_DEATHS[is.na(df$TOTAL_DEATHS)] <- 0
df$TOTAL_AFFECTED[is.na(df$TOTAL_AFFECTED)] <- 0

df$MAGNITUDE[is.na(df$MAGNITUDE)] <-
median(df$MAGNITUDE,na.rm=TRUE)

Preparation focused on selecting relevant variables and handling missing values to ensure reliable analysis. Missing deaths and affected counts were replaced with zero to avoid bias. Magnitude missing values were replaced using the median to maintain stability. These steps help maintain data consistency.

6 Feature Engineering

df <- df %>%

mutate(

START_MONTH = ifelse(is.na(START_MONTH),1,START_MONTH),

START_DAY = ifelse(is.na(START_DAY),1,START_DAY),

START_DATE = make_date(
START_YEAR,
START_MONTH,
START_DAY
),

DECADE = floor(START_YEAR/10)*10,

LOG_DEATHS = log1p(TOTAL_DEATHS)

)

New variables were created to support time-based analysis and improve modeling. The decade variable helps identify long-term trends. Log transformation reduces skewness caused by extreme death counts. These features improve interpretability and model behavior.

7 Exploratory Data Analysis

7.1 Disaster Frequency

df %>%

count(DISASTER_TYPE) %>%

slice_max(n,n=10) %>%

ggplot(aes(reorder(DISASTER_TYPE,n),n))+

geom_col(fill="#2C7FB8")+

coord_flip()+

labs(
title="Most common disasters",
x="Disaster type",
y="Count"
)

Floods are the most frequently recorded disasters followed by storms and transport-related accidents. This shows that hydrological and meteorological disasters form a large share of global disaster records. However, high occurrence does not necessarily imply high severity. This highlights the need to consider both frequency and impact when evaluating disaster risk.

7.2 Regional Distribution

df %>%

count(REGION) %>%

ggplot(aes(REGION,n))+

geom_col(fill="#36454F")+

labs(
title="Disasters by region",
x="Region",
y="Count"
)

Asia has the highest number of recorded disasters, followed by Africa and the Americas. This shows regional variation in disaster occurrence. The pattern may reflect differences in exposure, population distribution, and reporting coverage. These results highlight geographic variation in global disaster records.

7.3 Mortality Analysis

df %>%

group_by(DISASTER_TYPE) %>%

summarise(
deaths=sum(TOTAL_DEATHS)
) %>%

slice_max(deaths,n=10) %>%

ggplot(aes(reorder(DISASTER_TYPE,deaths),deaths))+

geom_col(fill="#CD5C5C")+

coord_flip()+

labs(
title="Deaths by disaster type",
x="Disaster type",
y="Deaths"
)

Earthquakes contribute the highest number of deaths despite relatively lower frequency. This indicates that certain disaster types have much higher mortality impact than others. The results highlight the difference between disaster occurrence and disaster severity. This demonstrates the importance of evaluating both frequency and mortality when assessing disaster risk.

7.4 Time Trend

df %>%

count(START_YEAR) %>%

ggplot(aes(START_YEAR,n))+

geom_line(color="#1B9E77",linewidth=1)+

labs(
title="Disaster trend over time",
x="Year",
y="Count"
)

The trend shows higher disaster counts in the early 2000s followed by a gradual stabilization in later years. The sharp drop in the most recent year likely reflects incomplete reporting rather than a true decline. Overall, the data suggests relatively stable disaster occurrence rather than a strong increasing trend. Reporting practices may also influence these patterns.

7.5 Countries with Highest Disaster Mortality

df %>%

group_by(COUNTRY) %>%

summarise(
deaths=sum(TOTAL_DEATHS)
)%>%

slice_max(deaths,n=10)%>%

ggplot(aes(reorder(COUNTRY,deaths),deaths))+

geom_col(fill="#148F77")+

coord_flip()+

labs(
title="Countries with highest cumulative disaster deaths",
x="Country",
y="Total deaths"
)

The results show disaster mortality is concentrated among a few countries rather than evenly distributed globally. This reflects the impact of major catastrophic disasters occurring in specific regions. The pattern suggests disaster mortality is driven more by extreme events than by frequency alone. This highlights the importance of monitoring high-impact disaster risks.

7.6 Disaster Severity Comparison

df %>%

group_by(DISASTER_TYPE)%>%

summarise(
avg_deaths=mean(TOTAL_DEATHS)
)%>%

slice_max(avg_deaths,n=10)%>%

ggplot(aes(reorder(DISASTER_TYPE,avg_deaths),avg_deaths))+

geom_col(fill="#7D6608")+

coord_flip()+

labs(
title="Average deaths per disaster",
x="Disaster type",
y="Average deaths"
)

The results show earthquakes have the highest average mortality per disaster event, followed by extreme temperature events and epidemics. This indicates that disaster impact differs substantially across disaster types. The findings highlight the importance of considering severity in addition to disaster occurrence. Disaster mortality is not evenly distributed across disaster categories.

7.7 Population Exposure Trend

df %>%

group_by(START_YEAR)%>%

summarise(
affected=sum(TOTAL_AFFECTED)
)%>%

ggplot(aes(START_YEAR,affected))+

geom_line(color="#2C3E50",linewidth=1)+

labs(
title="Total affected population over time",
x="Year",
y="Total affected"
)

The total affected population shows large variation across years, with a few extreme years dominating the overall trend. This indicates disaster impact is driven by major catastrophic events rather than consistent yearly patterns. The results show disaster exposure is highly uneven over time. This reflects the concentration of impact in major disaster years.

7.8 Regional Disaster Mortality Composition

df %>%

group_by(REGION, DISASTER_TYPE)%>%

summarise(
deaths=sum(TOTAL_DEATHS),
.groups="drop"
)%>%

group_by(REGION)%>%

slice_max(deaths,n=3)%>%

ggplot(aes(REGION,deaths,fill=DISASTER_TYPE))+

geom_col()+

labs(
title="Top disaster mortality contributors by region",
x="Region",
y="Total deaths",
fill="Disaster type"
)

The results show disaster mortality varies across regions and disaster types. Earthquakes dominate mortality in Asia and the Americas, while extreme temperature events contribute substantially in Europe. Epidemics appear as major contributors in Africa. These results highlight regional differences in the types of disasters contributing most to mortality.

8 Statistical Analysis

8.1 Correlation Analysis

num_df <- df %>%

select(
TOTAL_DEATHS,
TOTAL_AFFECTED,
MAGNITUDE
)

cor(num_df,use="complete.obs")

##                 TOTAL_DEATHS TOTAL_AFFECTED     MAGNITUDE
## TOTAL_DEATHS    1.0000000000     0.03774357 -0.0003791226
## TOTAL_AFFECTED  0.0377435746     1.00000000  0.0236046283
## MAGNITUDE      -0.0003791226     0.02360463  1.0000000000

The correlation analysis shows negligible relationships between disaster magnitude, affected population, and mortality. The very low correlation values suggest these factors alone do not strongly explain mortality variation. This indicates disaster outcomes may be influenced by additional variables. The results highlight the complexity of disaster impact.

8.2 Correlation Heatmap

cor_df <- as.data.frame(as.table(cor(num_df,use="complete.obs")))

ggplot(cor_df,
aes(Var1,Var2,fill=Freq))+

geom_tile()+

geom_text(aes(label=round(Freq,2)))+

scale_fill_gradient2(
low="#2C7BB6",
mid="white",
high="#D7191C"
)+

labs(title="Correlation matrix")

The heatmap shows negligible correlations among deaths, magnitude, and affected population. The very low values indicate weak linear relationships between these variables. This suggests disaster mortality is influenced by additional factors not captured here. The results highlight the complexity of disaster impacts.

8.3 Regression Analysis

model <- lm(
LOG_DEATHS ~ MAGNITUDE + TOTAL_AFFECTED,
data=df
)

summary(model)

## 
## Call:
## lm(formula = LOG_DEATHS ~ MAGNITUDE + TOTAL_AFFECTED, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.4440 -1.2546  0.2119  0.9428  9.9141 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2.353e+00  1.259e-02 186.956  < 2e-16 ***
## MAGNITUDE      7.570e-08  3.764e-08   2.011   0.0443 *  
## TOTAL_AFFECTED 1.240e-08  2.812e-09   4.408 1.05e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.625 on 16762 degrees of freedom
## Multiple R-squared:  0.001424,   Adjusted R-squared:  0.001305 
## F-statistic: 11.95 on 2 and 16762 DF,  p-value: 6.488e-06

The regression analysis shows magnitude and total affected population are statistically significant predictors of disaster mortality. However, the very low R-squared value indicates the model explains only a small fraction of the variability. This suggests additional variables are needed to better explain disaster mortality. The results highlight the complexity of disaster outcomes.

8.4 Exposure vs Mortality

ggplot(df,
aes(log1p(TOTAL_AFFECTED), LOG_DEATHS))+

geom_point(
color="#2C7FB8",
alpha=0.5
)+

geom_smooth(
method="lm",
color="#D95F0E"
)+

labs(
title="Affected population vs mortality",
x="Log total affected population",
y="Log deaths"
)

## `geom_smooth()` using formula = 'y ~ x'

The plot shows a very weak relationship between affected population and mortality. The nearly flat regression line indicates limited association between exposure and deaths. The wide spread of observations suggests mortality depends on additional factors. Exposure alone is not a strong predictor of disaster mortality.

9 Decade Analysis

df %>%

count(DECADE)%>%

ggplot(aes(DECADE,n))+

geom_col(fill="#1B9E77")+

labs(title="Disasters by decade")

The plot shows disaster counts are highest in the 2000s and lower in later decades. The lower values in the most recent decade may reflect incomplete data coverage. This suggests recent trends should be interpreted cautiously. Reporting differences across periods may influence these patterns.

10 Key Findings

Major observations from this analysis:

Floods were the most common disasters in the dataset.
Earthquakes caused the highest number of deaths despite occurring less often.
Asia recorded the highest number of disasters among all regions.
Disaster impact varies by type, with earthquakes and extreme temperature events showing higher deaths per event.
Magnitude and affected population showed only weak relationships with mortality.
Most disaster deaths appear to come from a few major catastrophic events.
Different disaster types contribute to mortality differently across regions.
The number of affected people varies greatly by year, often driven by major disasters.
Mortality is likely influenced by factors beyond magnitude and exposure.
Trends across decades should be interpreted carefully due to possible reporting differences.

11 Conclusion

This analysis highlights an important reality about global disasters: the most frequent disasters are not always the most dangerous. While floods occur most often, earthquakes continue to cause the greatest loss of life due to their sudden and destructive nature. This contrast shows that disaster risk must be understood not only through frequency, but through impact.

The results also show that geography and population exposure may influence disaster outcomes. Regions with higher exposure appear to experience greater disaster burden. However, the variation in mortality across similar exposure levels suggests that additional factors such as preparedness, infrastructure, and response systems may also influence outcomes.

From a statistical perspective, the analysis shows disaster magnitude and affected population have statistically significant but weak associations with mortality. The low explanatory power of the model suggests disaster mortality depends on multiple factors beyond those included in this analysis.

Overall, this study demonstrates how data analysis can transform disaster records into meaningful insights. Such analysis can support researchers and disaster management agencies in understanding disaster patterns and identifying areas for further investigation. Data-driven approaches remain important for improving disaster preparedness and reducing human impact.

12 Scope

Future analysis could incorporate additional variables such as economic damage, disaster duration, infrastructure indicators, and preparedness metrics to better understand drivers of disaster mortality. Machine learning approaches may also help identify complex relationships not captured by simple regression models.