This report analyzes the U.S. NOAA Storm Database (events from 1950
through November 2011) to identify which event types are most
harmful to population health and which have the
greatest economic consequences.
We begin from the raw compressed CSV (.csv.bz2) and perform
all data processing within this document to ensure full
reproducibility.
For health impact, we consider both fatalities and
injuries.
For economic impact, we compute property and
crop damages after converting NOAA’s damage exponents
(e.g., K, M, B) to numeric
multipliers.
Because EVTYPE values are messy and inconsistent across
decades, we apply a transparent, pattern-based grouping into canonical
event categories (e.g., TORNADO, FLOOD,
THUNDERSTORM WIND, HURRICANE/TYPHOON).
We present two figures (top-10 health impact; top-10 economic damage)
with descriptive captions.
Earlier years in the database contain fewer recorded events; more recent
years are considered more complete.
This report is intended to support decision-makers who must prioritize
preparedness and mitigation resources.
TL;DR (fill after knitting): After grouping, the largest impacts typically come from events like TORNADO (health) and FLOOD/HURRICANE (economic), though exact ranks and totals are shown in the figures below.
We rely on data.table for fast reads, dplyr
for transformation, lubridate for dates, and
stringr for text normalization.
library(data.table)
library(dplyr)
library(lubridate)
library(stringr)
library(ggplot2)
library(knitr)
We download the raw compressed CSV if it is not already
present.
> Source typically used in the Reproducible Research course.
if (!dir.exists("data")) dir.create("data", showWarnings = FALSE)
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
destfile <- "data/StormData.csv.bz2"
if (!file.exists(destfile)) {
download.file(url, destfile, mode = "wb", quiet = TRUE)
}
file.info(destfile)[, c("size", "mtime")]
storm_raw <- fread("data/StormData.csv.bz2", stringsAsFactors = FALSE, showProgress = FALSE)
dim(storm_raw)
## [1] 902297 37
head(storm_raw[, .(BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)])
Key variables used: - EVTYPE: event type (text; very
messy historically) - FATALITIES, INJURIES:
population health metrics - PROPDMG,
PROPDMGEXP: property damage and exponent code -
CROPDMG, CROPDMGEXP: crop damage and exponent
code - BGN_DATE: start date of the event
We parse dates, normalize EVTYPE to improve grouping,
and convert damage exponents to numeric multipliers.
Rules for damage exponents: - K = 1,000; M
= 1,000,000; B = 1,000,000,000; H = 100
- Empty/unknown/+/-/0 →
multiplier = 1
- Digits 1–9 are treated as
10^digit (a common interpretation in this dataset)
- Any other unrecognized code defaults to multiplier = 1
(conservative)
storm_raw <- storm_raw %>%
mutate(
BGN_DATE = mdy_hms(BGN_DATE),
EVTYPE = toupper(EVTYPE),
EVTYPE = str_replace_all(EVTYPE, "[[:punct:]]", " "),
EVTYPE = str_squish(EVTYPE)
)
dmg_exp_to_num <- function(exp) {
exp <- toupper(str_trim(as.character(exp)))
exp[is.na(exp)] <- ""
ifelse(exp %in% c("", "0", "+", "-"), 1,
ifelse(exp == "H", 1e2,
ifelse(exp == "K", 1e3,
ifelse(exp == "M", 1e6,
ifelse(exp == "B", 1e9,
ifelse(grepl("^[0-9]+$", exp), 10^as.numeric(exp), 1))))))
}
storm_clean <- storm_raw %>%
mutate(
PROP_MULT = dmg_exp_to_num(PROPDMGEXP),
CROP_MULT = dmg_exp_to_num(CROPDMGEXP),
property_damage = as.numeric(PROPDMG) * PROP_MULT,
crop_damage = as.numeric(CROPDMG) * CROP_MULT,
total_damage = property_damage + crop_damage
) %>%
select(BGN_DATE, STATE, EVTYPE, FATALITIES, INJURIES,
property_damage, crop_damage, total_damage)
NOAA’s event type values contain many variants and typographical
differences.
We apply a pattern-based mapping to consolidate common classes.
This mapping is transparent and can be adjusted;
unmatched events retain their cleaned EVTYPE.
storm_clean <- storm_clean %>%
mutate(event_group = dplyr::case_when(
grepl("TORNADO", EVTYPE) ~ "TORNADO",
grepl("HURRICANE|TYPHOON|STORM SURGE", EVTYPE) ~ "HURRICANE/TYPHOON",
grepl("THUNDERSTORM WIND|TSTM WIND|\\bTSTM\\b|THUNDERSTORM", EVTYPE) ~ "THUNDERSTORM WIND",
grepl("\\bHAIL\\b", EVTYPE) ~ "HAIL",
grepl("FLASH FLOOD", EVTYPE) ~ "FLASH FLOOD",
grepl("\\bFLOOD\\b|URBAN.*FLOOD", EVTYPE) ~ "FLOOD",
grepl("\\bHEAT\\b|EXCESSIVE HEAT|RECORD HEAT|WARMTH|HYPERTHERMIA", EVTYPE) ~ "HEAT",
grepl("DROUGHT|DRY", EVTYPE) ~ "DROUGHT",
grepl("LIGHTNING", EVTYPE) ~ "LIGHTNING",
grepl("BLIZZARD|WINTER|SNOW|ICE|FREEZ|SLEET", EVTYPE) ~ "WINTER WEATHER",
grepl("WIND", EVTYPE) & !grepl("THUNDERSTORM", EVTYPE) ~ "WIND",
TRUE ~ EVTYPE
))
length(unique(storm_clean$EVTYPE))
## [1] 837
length(unique(storm_clean$event_group))
## [1] 332
kable(head(sort(table(storm_clean$event_group), decreasing = TRUE)), col.names = c("event_group", "count"))
| event_group | count |
|---|---|
| THUNDERSTORM WIND | 336806 |
| HAIL | 289279 |
| TORNADO | 60700 |
| FLASH FLOOD | 55667 |
| WINTER WEATHER | 44029 |
| WIND | 28129 |
We sum fatalities and injuries per event_group and
present the top 10.
health_summary <- storm_clean %>%
group_by(event_group) %>%
summarise(
total_fatalities = sum(as.numeric(FATALITIES), na.rm = TRUE),
total_injuries = sum(as.numeric(INJURIES), na.rm = TRUE),
total_health = total_fatalities + total_injuries,
.groups = "drop"
) %>%
arrange(desc(total_health))
top10_health <- head(health_summary, 10)
ggplot(top10_health, aes(x = reorder(event_group, total_health), y = total_health)) +
geom_col() +
coord_flip() +
labs(
title = "Top 10 Event Types by Human Impact",
x = "Event Type (grouped)",
y = "Fatalities + Injuries (total)"
) +
theme_minimal()
Top 10 event groups by total population health impact (fatalities + injuries).
kable(top10_health, digits = 0, caption = "Table: Top-10 event groups by total health impact.")
| event_group | total_fatalities | total_injuries | total_health |
|---|---|---|---|
| TORNADO | 5661 | 91407 | 97068 |
| HEAT | 3139 | 9224 | 12363 |
| THUNDERSTORM WIND | 729 | 9544 | 10273 |
| FLOOD | 478 | 6793 | 7271 |
| WINTER WEATHER | 658 | 6052 | 6710 |
| LIGHTNING | 817 | 5231 | 6048 |
| FLASH FLOOD | 1035 | 1802 | 2837 |
| WIND | 690 | 1935 | 2625 |
| HURRICANE/TYPHOON | 159 | 1376 | 1535 |
| HAIL | 15 | 1371 | 1386 |
Interpretation (edit after knitting):
Describe the top-ranked event types by total health impact and note
whether the burden is driven more by injuries or fatalities for
each.
We compute property + crop damages (USD) and present the top 10 (in billions).
econ_summary <- storm_clean %>%
group_by(event_group) %>%
summarise(
total_property = sum(property_damage, na.rm = TRUE),
total_crop = sum(crop_damage, na.rm = TRUE),
total_damage = total_property + total_crop,
.groups = "drop"
) %>%
arrange(desc(total_damage))
top10_econ <- head(econ_summary, 10)
ggplot(top10_econ, aes(x = reorder(event_group, total_damage), y = total_damage / 1e9)) +
geom_col() +
coord_flip() +
labs(
title = "Top 10 Event Types by Economic Damage",
x = "Event Type (grouped)",
y = "Total Damage (billion USD)"
) +
theme_minimal()
Top 10 event groups by total economic damage (property + crop). Values shown in billions of USD.
kable(mutate(top10_econ,
total_property = round(total_property/1e9, 2),
total_crop = round(total_crop/1e9, 2),
total_damage = round(total_damage/1e9, 2)),
col.names = c("event_group","property (B)","crop (B)","total (B)"),
caption = "Table: Top-10 event groups by economic damage (billions USD).")
| event_group | property (B) | crop (B) | total (B) |
|---|---|---|---|
| FLOOD | 150.21 | 10.81 | 161.01 |
| HURRICANE/TYPHOON | 133.32 | 5.52 | 138.84 |
| TORNADO | 58.60 | 0.42 | 59.02 |
| WINTER WEATHER | 12.47 | 7.21 | 19.68 |
| FLASH FLOOD | 17.59 | 1.53 | 19.12 |
| HAIL | 15.74 | 3.05 | 18.78 |
| DROUGHT | 1.05 | 13.97 | 15.03 |
| THUNDERSTORM WIND | 11.18 | 1.27 | 12.46 |
| TROPICAL STORM | 7.70 | 0.68 | 8.38 |
| WIND | 6.20 | 0.78 | 6.98 |
Interpretation (edit after knitting):
Discuss which categories dominate economic losses and whether property
vs. crop damages differ by event type (e.g., drought vs. flood).
StormData.csv.bz2) and performs all cleaning and
transformations within this R Markdown document.EVTYPE. The mapping is shown and can be refined; unmatched
events retain their cleaned label.sessionInfo()
## R version 4.4.1 (2024-06-14 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_India.utf8 LC_CTYPE=English_India.utf8
## [3] LC_MONETARY=English_India.utf8 LC_NUMERIC=C
## [5] LC_TIME=English_India.utf8
##
## time zone: Asia/Calcutta
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.50 ggplot2_3.5.2 stringr_1.5.1 lubridate_1.9.4
## [5] dplyr_1.1.4 data.table_1.16.2
##
## loaded via a namespace (and not attached):
## [1] vctrs_0.6.5 cli_3.6.3 rlang_1.1.4 xfun_0.52
## [5] stringi_1.8.4 generics_0.1.4 jsonlite_1.8.8 labeling_0.4.3
## [9] glue_1.7.0 htmltools_0.5.8.1 sass_0.4.9 scales_1.4.0
## [13] rmarkdown_2.29 grid_4.4.1 evaluate_0.24.0 jquerylib_0.1.4
## [17] tibble_3.3.0 fastmap_1.2.0 yaml_2.3.9 lifecycle_1.0.4
## [21] compiler_4.4.1 RColorBrewer_1.1-3 timechange_0.3.0 pkgconfig_2.0.3
## [25] farver_2.1.2 digest_0.6.36 R6_2.5.1 tidyselect_1.2.1
## [29] pillar_1.11.0 magrittr_2.0.3 bslib_0.7.0 withr_3.0.0
## [33] gtable_0.3.6 tools_4.4.1 cachem_1.1.0