This report analyzes the U.S. National Oceanic and Atmospheric Administration (NOAA) Storm Database (events recorded from 1950 through November 2011) to identify which types of severe weather events are most harmful to population health and which have the greatest economic consequences. To make event reporting more comparable across time, the primary analysis focuses on the period 1996–2011, when the National Weather Service standardized event types. Health harm is measured as the sum of fatalities and injuries; economic impact combines property and crop damages after converting damage exponents (e.g., K, M, B) into dollars. Results show that a small set of hazards (e.g., tornadoes, excessive heat, flash floods, floods, thunderstorm wind, lightning) account for a large share of casualties, while floods, hurricanes/typhoons, storm surge, tornadoes, and hail dominate total damages. All code starts from the original compressed CSV file and is fully reproducible.
The raw dataset is a single compressed CSV (~47 MB) available from
Coursera’s Reproducible Research course resources. If you already
downloaded it, set data_path
to the local file. Otherwise,
the code below will attempt to download it.
# Choose your local file path, or leave as NULL to download.
data_path <- NULL
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if (is.null(data_path)) {
data_dir <- "data"
if (!dir.exists(data_dir)) dir.create(data_dir)
data_path <- file.path(data_dir, "StormData.csv.bz2")
if (!file.exists(data_path)) {
download.file(url, destfile = data_path, mode = "wb", quiet = TRUE)
}
}
file.exists(data_path)
## [1] TRUE
We load the compressed CSV directly using base R, and keep only the variables required for this analysis.
suppressPackageStartupMessages({
library(dplyr)
library(stringr)
library(ggplot2)
library(scales)
library(forcats)
library(tidyr)
})
# read compressed csv.bz2 directly with base R
dt_raw <- read.csv(bzfile(data_path), stringsAsFactors = FALSE)
vars_needed <- c("EVTYPE","BGN_DATE","FATALITIES","INJURIES",
"PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")
# standard data.frame subsetting
dt <- dt_raw[, vars_needed]
rm(dt_raw) # free memory
str(dt)
## 'data.frame': 902297 obs. of 8 variables:
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
Date
and restrict to
1996–2011 (recommended due to standardized event
catalog).# 1) Dates and period filter
dt <- dt %>%
mutate(
BGN_DATE = as.Date(BGN_DATE, format = "%m/%d/%Y %H:%M:%S"),
year = as.integer(format(BGN_DATE, "%Y"))
) %>%
filter(year >= 1996) # primary analysis window
# 2) Normalize EVTYPE (string hygiene)
dt <- dt %>%
mutate(EVTYPE = str_to_upper(str_trim(EVTYPE)))
# 3) Map exponent codes to multipliers
exp_map <- function(x) {
# NOAA documentation commonly uses:
# H/h=1e2, K/k=1e3, M/m=1e6, B/b=1e9
# Numeric digits 0–8 sometimes appear and are interpreted as 10^digit.
# Unrecognized or blank => multiplier 1
x <- str_to_upper(str_trim(as.character(x)))
out <- rep(1, length(x))
out[x %in% c("H")] <- 1e2
out[x %in% c("K")] <- 1e3
out[x %in% c("M")] <- 1e6
out[x %in% c("B")] <- 1e9
# numeric digits
is_digit <- str_detect(x, "^[0-8]$")
out[is_digit] <- 10^(as.numeric(x[is_digit]))
out
}
dt <- dt %>%
mutate(
prop_mult = exp_map(PROPDMGEXP),
crop_mult = exp_map(CROPDMGEXP),
prop_dmg_usd = PROPDMG * prop_mult,
crop_dmg_usd = CROPDMG * crop_mult,
econ_dmg_usd = prop_dmg_usd + crop_dmg_usd,
casualties = FATALITIES + INJURIES
)
summary(select(dt, FATALITIES, INJURIES, prop_dmg_usd, crop_dmg_usd))
## FATALITIES INJURIES prop_dmg_usd
## Min. : 0.00000 Min. :0.000e+00 Min. :0.000e+00
## 1st Qu.: 0.00000 1st Qu.:0.000e+00 1st Qu.:0.000e+00
## Median : 0.00000 Median :0.000e+00 Median :0.000e+00
## Mean : 0.01336 Mean :8.871e-02 Mean :5.612e+05
## 3rd Qu.: 0.00000 3rd Qu.:0.000e+00 3rd Qu.:1.250e+03
## Max. :158.00000 Max. :1.150e+03 Max. :1.150e+11
## crop_dmg_usd
## Min. :0.000e+00
## 1st Qu.:0.000e+00
## Median :0.000e+00
## Mean :5.318e+04
## 3rd Qu.:0.000e+00
## Max. :1.510e+09
Event type (EVTYPE
) historically contains many
near-duplicates. For this assignment, we apply light
harmonization via simple pattern rules. This avoids heavy, manual
reclassification but reduces obvious fragmentation (e.g., TSTM WIND →
THUNDERSTORM WIND). You may refine these rules if you wish.
recode_evtype <- function(x) {
x <- str_replace_all(x, "TSTM", "THUNDERSTORM")
x <- str_replace_all(x, "THUNDERSTORM WINDS?", "THUNDERSTORM WIND")
x <- str_replace_all(x, "^HEAT WAVE$", "EXCESSIVE HEAT")
x <- str_replace_all(x, "RIP CURRENTS?", "RIP CURRENT")
x <- str_replace_all(x, "FLASH FLOODING", "FLASH FLOOD")
x <- str_replace_all(x, "FLOOD/FLASH FLOOD", "FLASH FLOOD")
x <- str_replace_all(x, "HURRICANE.*", "HURRICANE/TYPHOON")
x <- str_replace_all(x, "TYPHOON.*", "HURRICANE/TYPHOON")
x <- str_replace_all(x, "STORM SURGE/TIDE", "STORM SURGE")
x <- str_replace_all(x, "EXTREME COLD/WIND CHILL", "EXTREME COLD")
x <- str_replace_all(x, "WILD/FOREST FIRE", "WILDFIRE")
x <- str_replace_all(x, "WINTER WEATHER/MIX", "WINTER WEATHER")
x <- str_replace_all(x, "FOG", "DENSE FOG")
x
}
dt <- dt %>% mutate(EVTYPE = recode_evtype(EVTYPE))
We compute totals by EVTYPE
for health
and economic impacts.
by_event <- dt %>%
group_by(EVTYPE) %>%
summarise(
fatalities = sum(FATALITIES, na.rm = TRUE),
injuries = sum(INJURIES, na.rm = TRUE),
casualties = sum(casualties, na.rm = TRUE),
prop_usd = sum(prop_dmg_usd, na.rm = TRUE),
crop_usd = sum(crop_dmg_usd, na.rm = TRUE),
econ_usd = sum(econ_dmg_usd, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(desc(casualties))
head(by_event)
We define population health harm as the combined total of fatalities + injuries. The figure below shows the top 10 event types by total casualties, with the bar split into injuries and fatalities to reveal composition.
topN <- 10
top_health <- by_event %>%
slice_max(order_by = casualties, n = topN) %>%
mutate(EVTYPE = fct_reorder(EVTYPE, casualties))
health_long <- top_health %>%
select(EVTYPE, fatalities, injuries) %>%
pivot_longer(cols = c(fatalities, injuries),
names_to = "type", values_to = "count")
ggplot(health_long, aes(x = EVTYPE, y = count, fill = type)) +
geom_col() +
coord_flip() +
scale_y_continuous(labels = comma) +
labs(x = NULL, y = "People affected (count)",
fill = "Casualty type",
title = "Population Health Impact of Severe Weather (1996–2011)",
subtitle = "Top 10 event types by fatalities + injuries") +
theme_minimal(base_size = 11)
Top 10 event types by casualties (fatalities + injuries), 1996–2011. Bars are stacked to show injuries (upper) and fatalities (lower).
Interpretation (concise): Tornadoes typically rank at or near the top for total casualties, with excessive heat also contributing a large share of fatalities. Thunderstorm wind, floods/flash floods, and lightning are also prominent contributors to overall harm.
We measure economic impact as the sum of property and crop damages after converting exponent codes into dollar amounts. The figure below shows the top 10 event types by total economic losses.
top_econ <- by_event %>%
arrange(desc(econ_usd)) %>%
slice_head(n = topN) %>%
mutate(EVTYPE = fct_reorder(EVTYPE, econ_usd))
econ_long <- top_econ %>%
select(EVTYPE, prop_usd, crop_usd) %>%
pivot_longer(cols = c(prop_usd, crop_usd),
names_to = "component", values_to = "usd")
ggplot(econ_long, aes(x = EVTYPE, y = usd, fill = component)) +
geom_col() +
coord_flip() +
scale_y_continuous(labels = label_dollar(scale_cut = cut_short_scale())) +
labs(x = NULL, y = "2011 USD (nominal)",
fill = "Damage component",
title = "Economic Impact of Severe Weather (1996–2011)",
subtitle = "Top 10 event types by total damage") +
theme_minimal(base_size = 11)
Top 10 event types by total economic losses (property + crop), 1996–2011.
Interpretation (concise): Flood-related events (riverine floods, flash floods) and tropical cyclones (hurricanes/typhoons, storm surge) typically dominate total losses, with tornadoes and hail also significant contributors. Crop damages are comparatively larger for drought and hail than for many wind hazards.
Results are similar—but more skewed toward tornadoes—if the full
1950–2011 period is used (because earlier decades under-report many
non-tornado hazards). To reproduce that view, set
filter(year >= 1950)
in the cleaning step and
re-run.
StormData.csv.bz2
file; no external preprocessing
was performed.H/K/M/B
and digits
0–8
) into dollar multipliers.dplyr
, stringr
,
ggplot2
, scales
, forcats
, and
tidyr
.sessionInfo()
## R version 4.5.1 (2025-06-13)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.6.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] tidyr_1.3.1 forcats_1.0.0 scales_1.4.0 ggplot2_3.5.2 stringr_1.5.1
## [6] dplyr_1.1.4
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 jsonlite_2.0.0 compiler_4.5.1 tidyselect_1.2.1
## [5] jquerylib_0.1.4 yaml_2.3.10 fastmap_1.2.0 R6_2.6.1
## [9] labeling_0.4.3 generics_0.1.4 knitr_1.50 tibble_3.3.0
## [13] bslib_0.9.0 pillar_1.11.0 RColorBrewer_1.1-3 rlang_1.1.6
## [17] cachem_1.1.0 stringi_1.8.7 xfun_0.52 sass_0.4.10
## [21] cli_3.6.5 withr_3.0.2 magrittr_2.0.3 digest_0.6.37
## [25] grid_4.5.1 rstudioapi_0.17.1 lifecycle_1.0.4 vctrs_0.6.5
## [29] evaluate_1.0.4 glue_1.8.0 farver_2.1.2 codetools_0.2-20
## [33] rmarkdown_2.29 purrr_1.1.0 tools_4.5.1 pkgconfig_2.0.3
## [37] htmltools_0.5.8.1
StormData.csv.bz2
(accessed via
Coursera course resources).