Synopsis

This report analyzes the U.S. National Oceanic and Atmospheric Administration (NOAA) Storm Events Database to identify which event types are most harmful to population health and which have the greatest economic consequences across the United States. We start from the raw CSV repdata_data_StormData.csv (no external pre-processing). We compute health impacts (fatalities, injuries) and standardized economic losses (property + crop). We show all R code (echo = TRUE) and cache heavy steps for speed. Our results typically show that a small number of event types account for a large share of adverse outcomes: tornadoes dominate injuries and fatalities, while floods, hurricanes/typhoons, and storm surges account for the largest economic losses. These findings can help public officials prioritize preparedness and mitigation resources.

Data Processing

Packages

# Expect these to be installed already. If not, run the install script first.
library(dplyr)
## 
## Adjuntando el paquete: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(stringr)
library(readr)
library(forcats)
library(tidyr)
library(knitr)

Load raw data

csv_path <- "repdata_data_StormData.csv"
stopifnot(file.exists(csv_path))

storm_raw <- readr::read_csv(
  csv_path,
  show_col_types = FALSE,
  progress = FALSE
)

storm <- storm_raw %>%
  dplyr::select(
    BGN_DATE, STATE, EVTYPE,
    FATALITIES, INJURIES,
    PROPDMG, PROPDMGEXP,
    CROPDMG, CROPDMGEXP
  )

rm(storm_raw)

Variable Description (Data Dictionary)

The following table describes the original key variables used in this analysis and the derived fields we create. All text is taken from NOAA documentation conventions and common usage in the Storm Events dataset.

var_dict <- tibble::tibble(
  Variable = c(
    "BGN_DATE","STATE","EVTYPE",
    "FATALITIES","INJURIES",
    "PROPDMG","PROPDMGEXP",
    "CROPDMG","CROPDMGEXP",
    "prop_mult","crop_mult",
    "prop_loss","crop_loss","total_loss"
  ),
  Type = c(
    "Date/Time","Factor/Character","Factor/Character",
    "Numeric","Numeric",
    "Numeric","Character",
    "Numeric","Character",
    "Numeric","Numeric",
    "Numeric (USD)","Numeric (USD)","Numeric (USD)"
  ),
  Description = c(
    "Date the event began.",
    "U.S. state/territory code.",
    "Event type label as recorded (e.g., TORNADO, FLOOD).",
    "Number of deaths directly/indirectly attributable to the event.",
    "Number of injuries directly/indirectly attributable to the event.",
    "Property damage base amount before exponent.",
    "Exponent for property damage: H=10^2, K=10^3, M=10^6, B=10^9, digits=10^digit, others treated as 1.",
    "Crop damage base amount before exponent.",
    "Exponent for crop damage with same convention as PROPDMGEXP.",
    "Multiplier derived from PROPDMGEXP.",
    "Multiplier derived from CROPDMGEXP.",
    "Standardized property damage in USD (PROPDMG * prop_mult).",
    "Standardized crop damage in USD (CROPDMG * crop_mult).",
    "Total economic loss = prop_loss + crop_loss in USD."
  )
)
kable(var_dict, caption = "Key variables and derived fields used in this analysis.")
Key variables and derived fields used in this analysis.
Variable Type Description
BGN_DATE Date/Time Date the event began.
STATE Factor/Character U.S. state/territory code.
EVTYPE Factor/Character Event type label as recorded (e.g., TORNADO, FLOOD).
FATALITIES Numeric Number of deaths directly/indirectly attributable to the event.
INJURIES Numeric Number of injuries directly/indirectly attributable to the event.
PROPDMG Numeric Property damage base amount before exponent.
PROPDMGEXP Character Exponent for property damage: H=10^2, K=10^3, M=10^6, B=10^9, digits=10^digit, others treated as 1.
CROPDMG Numeric Crop damage base amount before exponent.
CROPDMGEXP Character Exponent for crop damage with same convention as PROPDMGEXP.
prop_mult Numeric Multiplier derived from PROPDMGEXP.
crop_mult Numeric Multiplier derived from CROPDMGEXP.
prop_loss Numeric (USD) Standardized property damage in USD (PROPDMG * prop_mult).
crop_loss Numeric (USD) Standardized crop damage in USD (CROPDMG * crop_mult).
total_loss Numeric (USD) Total economic loss = prop_loss + crop_loss in USD.

Normalize event types (EVTYPE)

For transparency, we apply minimal standardization: trim whitespace and convert to UPPER CASE.

storm <- storm %>%
  mutate(EVTYPE = stringr::str_squish(stringr::str_to_upper(as.character(EVTYPE))))

Health outcomes

health_by_event <- storm %>%
  group_by(EVTYPE) %>%
  summarise(
    fatalities = sum(FATALITIES, na.rm = TRUE),
    injuries   = sum(INJURIES,   na.rm = TRUE),
    health_harm = fatalities + injuries,
    .groups = "drop"
  ) %>%
  filter(health_harm > 0)

top_health <- health_by_event %>%
  arrange(desc(health_harm)) %>%
  slice_head(n = 10)

knitr::kable(top_health, caption = "Top 10 event types by combined health harm (fatalities + injuries).")
Top 10 event types by combined health harm (fatalities + injuries).
EVTYPE fatalities injuries health_harm
TORNADO 5633 91346 96979
EXCESSIVE HEAT 1903 6525 8428
TSTM WIND 504 6957 7461
FLOOD 470 6789 7259
LIGHTNING 816 5230 6046
HEAT 937 2100 3037
FLASH FLOOD 978 1777 2755
ICE STORM 89 1975 2064
THUNDERSTORM WIND 133 1488 1621
WINTER STORM 206 1321 1527

Economic damages (standardization)

NOAA stores damage amounts as a base and an exponent. We convert the exponent fields into numeric multipliers and compute standardized USD losses.

exp_to_multiplier <- function(x) {
  x <- toupper(trimws(as.character(x)))
  m <- rep(1, length(x))
  m[x %in% c("H")] <- 1e2
  m[x %in% c("K")] <- 1e3
  m[x %in% c("M")] <- 1e6
  m[x %in% c("B")] <- 1e9
  is_digit <- grepl("^[0-8]$", x)
  m[is_digit] <- 10 ^ as.numeric(x[is_digit])
  m
}

storm <- storm %>%
  mutate(
    prop_mult = exp_to_multiplier(PROPDMGEXP),
    crop_mult = exp_to_multiplier(CROPDMGEXP),
    prop_loss = as.numeric(PROPDMG) * prop_mult,
    crop_loss = as.numeric(CROPDMG) * crop_mult,
    total_loss = prop_loss + crop_loss
  )

econ_by_event <- storm %>%
  group_by(EVTYPE) %>%
  summarise(
    property_damage = sum(prop_loss, na.rm = TRUE),
    crop_damage     = sum(crop_loss, na.rm = TRUE),
    economic_loss   = sum(total_loss, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  filter(economic_loss > 0)

top_econ <- econ_by_event %>%
  arrange(desc(economic_loss)) %>%
  slice_head(n = 10)

knitr::kable(top_econ, caption = "Top 10 event types by total economic loss (property + crop).")
Top 10 event types by total economic loss (property + crop).
EVTYPE property_damage crop_damage economic_loss
FLOOD 144657709807 5661968450 150319678257
HURRICANE/TYPHOON 69305840000 2607872800 71913712800
TORNADO 56947380677 414953270 57362333947
STORM SURGE 43323536000 5000 43323541000
HAIL 15735267513 3025954473 18761221986
FLASH FLOOD 16822723979 1421317100 18244041079
DROUGHT 1046106000 13972566000 15018672000
HURRICANE 11868319010 2741910000 14610229010
RIVER FLOOD 5118945500 5029459000 10148404500
ICE STORM 3944927860 5022113500 8967041360

Results

Which event types are most harmful to population health?

health_long <- health_by_event %>%
  semi_join(top_health, by = "EVTYPE") %>%
  select(EVTYPE, fatalities, injuries) %>%
  tidyr::pivot_longer(cols = c(fatalities, injuries),
                      names_to = "metric", values_to = "count") %>%
  mutate(EVTYPE = forcats::fct_reorder(EVTYPE, count, sum))

ggplot(health_long, aes(x = EVTYPE, y = count, fill = metric)) +
  geom_col() +
  coord_flip() +
  labs(
    x = "Event Type (EVTYPE)",
    y = "People Affected",
    fill = "Health Metric",
    title = "Top 10 Event Types by Health Harm (Fatalities + Injuries)"
  ) +
  theme_minimal(base_size = 12)

Which event types have the greatest economic consequences?

econ_long <- econ_by_event %>%
  semi_join(top_econ, by = "EVTYPE") %>%
  select(EVTYPE, property_damage, crop_damage) %>%
  tidyr::pivot_longer(cols = c(property_damage, crop_damage),
                      names_to = "type", values_to = "usd") %>%
  mutate(EVTYPE = forcats::fct_reorder(EVTYPE, usd, sum))

ggplot(econ_long, aes(x = EVTYPE, y = usd/1e9, fill = type)) +
  geom_col() +
  coord_flip() +
  labs(
    x = "Event Type (EVTYPE)",
    y = "Economic Loss (Billions of USD)",
    fill = "Damage Type",
    title = "Top 10 Event Types by Economic Loss (Property + Crop)"
  ) +
  theme_minimal(base_size = 12)

Analysis of Results

Below we provide a brief, decision-oriented interpretation of the results. The analysis automatically references the computed top categories.

# Extract top leaders for dynamic text
top_health_leader <- top_health$EVTYPE[1]
top_econ_leader   <- top_econ$EVTYPE[1]
health_share_top3 <- health_by_event %>%
  arrange(desc(health_harm)) %>%
  slice_head(n = 3) %>%
  summarise(share = sum(health_harm) / sum(health_by_event$health_harm)) %>%
  pull(share)

econ_share_top3 <- econ_by_event %>%
  arrange(desc(economic_loss)) %>%
  slice_head(n = 3) %>%
  summarise(share = sum(economic_loss) / sum(econ_by_event$economic_loss)) %>%
  pull(share)

cat(sprintf("**Health impacts.** `%s` is the single most harmful event type when combining fatalities and injuries. The top three event types account for roughly %.1f%% of total health harm, indicating a highly skewed risk distribution.\n\n",
            top_health_leader, 100*health_share_top3))

Health impacts. TORNADO is the single most harmful event type when combining fatalities and injuries. The top three event types account for roughly 72.5% of total health harm, indicating a highly skewed risk distribution.

cat(sprintf("**Economic impacts.** `%s` yields the largest total economic losses (property + crop). Similarly, the top three event types account for about %.1f%% of total losses, underscoring the value of targeting mitigation resources to a small set of high-impact hazards.\n\n",
            top_econ_leader, 100*econ_share_top3))

Economic impacts. FLOOD yields the largest total economic losses (property + crop). Similarly, the top three event types account for about 58.6% of total losses, underscoring the value of targeting mitigation resources to a small set of high-impact hazards.

# Additional insights: injury vs fatality composition, property vs crop composition
fatality_leader <- health_by_event %>% arrange(desc(fatalities)) %>% slice(1) %>% pull(EVTYPE)
injury_leader   <- health_by_event %>% arrange(desc(injuries))   %>% slice(1) %>% pull(EVTYPE)
prop_leader     <- econ_by_event    %>% arrange(desc(property_damage)) %>% slice(1) %>% pull(EVTYPE)
crop_leader     <- econ_by_event    %>% arrange(desc(crop_damage))     %>% slice(1) %>% pull(EVTYPE)

cat(sprintf("**Composition.** Fatalities are led by `%s`, while injuries are led by `%s`. On the economic side, property damages are highest for `%s`, and crop damages are dominated by `%s`.\n\n",
            fatality_leader, injury_leader, prop_leader, crop_leader))

Composition. Fatalities are led by TORNADO, while injuries are led by TORNADO. On the economic side, property damages are highest for FLOOD, and crop damages are dominated by DROUGHT.

cat("**Policy takeaway.** A risk-informed preparedness plan should emphasize the leading health-impact hazards (e.g., warning systems, shelter access) and the leading economic-impact hazards (e.g., flood and wind-resistant infrastructure, surge barriers, crop protection and drought management). Geographic tailoring matters, as leading hazards vary by state and season.\n")

Policy takeaway. A risk-informed preparedness plan should emphasize the leading health-impact hazards (e.g., warning systems, shelter access) and the leading economic-impact hazards (e.g., flood and wind-resistant infrastructure, surge barriers, crop protection and drought management). Geographic tailoring matters, as leading hazards vary by state and season.

Conclusions

Reproducibility Notes

Appendix: Session Info

sessionInfo()
## R version 4.4.1 (2024-06-14 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=Spanish_Mexico.utf8  LC_CTYPE=Spanish_Mexico.utf8   
## [3] LC_MONETARY=Spanish_Mexico.utf8 LC_NUMERIC=C                   
## [5] LC_TIME=Spanish_Mexico.utf8    
## 
## time zone: America/Guayaquil
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] knitr_1.48    tidyr_1.3.1   forcats_1.0.0 readr_2.1.5   stringr_1.5.1
## [6] ggplot2_3.5.1 dplyr_1.1.4  
## 
## loaded via a namespace (and not attached):
##  [1] bit_4.0.5         gtable_0.3.5      jsonlite_1.8.8    highr_0.11       
##  [5] crayon_1.5.3      compiler_4.4.1    tidyselect_1.2.1  parallel_4.4.1   
##  [9] jquerylib_0.1.4   scales_1.3.0      yaml_2.3.9        fastmap_1.2.0    
## [13] R6_2.5.1          labeling_0.4.3    generics_0.1.3    tibble_3.2.1     
## [17] munsell_0.5.1     bslib_0.7.0       pillar_1.9.0      tzdb_0.4.0       
## [21] rlang_1.1.4       utf8_1.2.4        cachem_1.1.0      stringi_1.8.4    
## [25] xfun_0.45         sass_0.4.9        bit64_4.0.5       cli_3.6.3        
## [29] withr_3.0.0       magrittr_2.0.3    digest_0.6.36     grid_4.4.1       
## [33] vroom_1.6.5       rstudioapi_0.16.0 hms_1.1.3         lifecycle_1.0.4  
## [37] vctrs_0.6.5       evaluate_0.24.0   glue_1.7.0        farver_2.1.2     
## [41] codetools_0.2-20  fansi_1.0.6       colorspace_2.1-0  purrr_1.0.2      
## [45] rmarkdown_2.27    tools_4.4.1       pkgconfig_2.0.3   htmltools_0.5.8.1