This report analyzes the U.S. National Oceanic and Atmospheric
Administration (NOAA) Storm Database (1950ā2011) to identify which types
of severe weather events are most harmful to population health and which
have the greatest economic consequences. We start from the raw
compressed CSV (.csv.bz2) file, perform minimal and
transparent preprocessing, and compute event-level summaries for
fatalities, injuries, and economic damages (property + crops). Results
are reproducible: all code is shown and figures are generated from the
data.
Reproducibility note: Heavy steps are cached with
cache=TRUE. Place or let the code downloadrepdata_data_StormData.csv.bz2next to this file.
repdata_data_StormData.csv.bz2 (Coursera
prompt; originally NOAA Storm Database).knitr::opts_chunk$set(echo=TRUE, message=FALSE, warning=FALSE, cache=TRUE, fig.align="center")
options(stringsAsFactors = FALSE)
# Minimal required packages
needed <- c("dplyr", "ggplot2")
to_install <- setdiff(needed, rownames(installed.packages()))
if (length(to_install)) install.packages(to_install, dependencies = TRUE)
library(dplyr)
library(ggplot2)
# Expected file in the same folder as this .Rmd
dest <- "repdata_data_StormData.csv.bz2"
# Auto-download if missing
if (!file.exists(dest)) {
url <- "https://d396qusza40orc.cloudfront.net/repdata/data/StormData.csv.bz2"
download.file(url, destfile = dest, mode = "wb", quiet = TRUE)
}
if (!file.exists(dest)) stop("Data file not found: ", dest)
# Robust read with base R (no readr dependency)
storm <- read.csv(bzfile(dest))
# Keep only relevant columns (fail fast if any missing)
req_cols <- c("EVTYPE","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")
missing <- setdiff(req_cols, names(storm))
if (length(missing)) stop("Missing columns in dataset: ", paste(missing, collapse=", "))
storm <- storm[, req_cols]
trimws) to reduce trivial variants.H/h=1e2,
K/k=1e3, M/m=1e6, B/b=1e9; digits
ā 10^digit; others/missing ā 1.prop_loss = PROPDMG * multiplier,
crop_loss = CROPDMG * multiplier,
total_loss = prop_loss + crop_loss.# 1) Normalize EVTYPE (base R)
storm$EVTYPE <- toupper(trimws(storm$EVTYPE))
# 2) Convert exponent codes to multipliers (base R)
exp_to_mult <- function(x) {
letter_map <- c(H=1e2, K=1e3, M=1e6, B=1e9)
vapply(x, function(v) {
if (is.na(v) || v == "" || v %in% c("+","-","?")) return(1)
vv <- as.character(v)
if (vv %in% names(letter_map)) return(letter_map[[vv]])
if (vv %in% tolower(names(letter_map))) return(letter_map[[toupper(vv)]])
if (grepl("^[0-9]$", vv)) return(10^as.numeric(vv))
1
}, numeric(1))
}
prop_mult <- exp_to_mult(storm$PROPDMGEXP)
crop_mult <- exp_to_mult(storm$CROPDMGEXP)
storm$prop_loss <- storm$PROPDMG * prop_mult
storm$crop_loss <- storm$CROPDMG * crop_mult
storm$total_loss <- storm$prop_loss + storm$crop_loss
# 3) Event-level summaries (dplyr)
event_summ <- storm %>%
group_by(EVTYPE) %>%
summarise(
fatalities = sum(FATALITIES, na.rm = TRUE),
injuries = sum(INJURIES, na.rm = TRUE),
prop_loss = sum(prop_loss, na.rm = TRUE),
crop_loss = sum(crop_loss, na.rm = TRUE),
total_loss = sum(total_loss, na.rm = TRUE),
.groups = "drop"
)
health_tbl <- event_summ %>%
mutate(health_impact = fatalities + injuries) %>%
arrange(desc(health_impact))
econ_tbl <- event_summ %>% arrange(desc(total_loss))
# Names for inline text if needed
top_health_name <- health_tbl$EVTYPE[1]
top_econ_name <- econ_tbl$EVTYPE[1]
# Quick peek
head(health_tbl, 5)
head(econ_tbl, 5)
We consider fatalities + injuries aggregated by event type.
top_n <- 10
top_health <- dplyr::slice_head(health_tbl, n = top_n)
ggplot(top_health, aes(x = reorder(EVTYPE, health_impact), y = health_impact)) +
geom_col() +
coord_flip() +
labs(
title = paste0("Top ", top_n, " Event Types by Health Impact (Fatalities + Injuries)"),
x = "Event Type", y = "Total Fatalities + Injuries"
) +
theme_minimal(base_size = 12)
Figure 1. Total fatalities + injuries by event type
(top 10).
The most harmful event type for population health is
TORNADO.
We consider total_loss = property + crop damages in USD (nominal).
top_n <- 10
top_econ <- dplyr::slice_head(econ_tbl, n = top_n)
ggplot(top_econ, aes(x = reorder(EVTYPE, total_loss), y = total_loss / 1e9)) +
geom_col() +
coord_flip() +
labs(
title = paste0("Top ", top_n, " Event Types by Economic Losses"),
x = "Event Type", y = "Total Losses (Billion USD)"
) +
theme_minimal(base_size = 12)
Figure 2. Total economic damages (property + crops)
by event type (top 10).
The event type with the highest overall economic losses is
FLOOD.
EVTYPE) contains many variants. We
applied minimal normalization (uppercase + trim) to stay close to raw
data. A stricter recoding to official NOAA categories could reduce
fragmentation but introduces subjectivity.sessionInfo()
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-apple-darwin20
## Running under: macOS Sequoia 15.3.2
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Santiago
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_3.5.2 dplyr_1.1.4
##
## loaded via a namespace (and not attached):
## [1] vctrs_0.6.5 cli_3.6.5 knitr_1.50 rlang_1.1.6
## [5] xfun_0.53 generics_0.1.4 jsonlite_2.0.0 labeling_0.4.3
## [9] glue_1.8.0 htmltools_0.5.8.1 sass_0.4.10 scales_1.4.0
## [13] rmarkdown_2.29 grid_4.5.1 evaluate_1.0.4 jquerylib_0.1.4
## [17] tibble_3.3.0 fastmap_1.2.0 yaml_2.3.10 lifecycle_1.0.4
## [21] compiler_4.5.1 codetools_0.2-20 RColorBrewer_1.1-3 pkgconfig_2.0.3
## [25] rstudioapi_0.17.1 farver_2.1.2 digest_0.6.37 R6_2.6.1
## [29] tidyselect_1.2.1 pillar_1.11.0 magrittr_2.0.3 bslib_0.9.0
## [33] withr_3.0.2 gtable_0.3.6 tools_4.5.1 cachem_1.1.0