1 Synopsis

This report analyzes the U.S. National Oceanic and Atmospheric Administration (NOAA) Storm Database (events recorded from 1950 through November 2011) to identify which types of severe weather events are most harmful to population health and which have the greatest economic consequences. To make event reporting more comparable across time, the primary analysis focuses on the period 1996–2011, when the National Weather Service standardized event types. Health harm is measured as the sum of fatalities and injuries; economic impact combines property and crop damages after converting damage exponents (e.g., K, M, B) into dollars. Results show that a small set of hazards (e.g., tornadoes, excessive heat, flash floods, floods, thunderstorm wind, lightning) account for a large share of casualties, while floods, hurricanes/typhoons, storm surge, tornadoes, and hail dominate total damages. All code starts from the original compressed CSV file and is fully reproducible.

2 Data Processing

2.1 Getting the data

The raw dataset is a single compressed CSV (~47 MB) available from Coursera’s Reproducible Research course resources. If you already downloaded it, set data_path to the local file. Otherwise, the code below will attempt to download it.

# Choose your local file path, or leave as NULL to download.
data_path <- NULL
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"

if (is.null(data_path)) {
  data_dir <- "data"
  if (!dir.exists(data_dir)) dir.create(data_dir)
  data_path <- file.path(data_dir, "StormData.csv.bz2")
  if (!file.exists(data_path)) {
    download.file(url, destfile = data_path, mode = "wb", quiet = TRUE)
  }
}

file.exists(data_path)

## [1] TRUE

2.2 Loading the raw CSV (no external preprocessing)

We load the compressed CSV directly using base R, and keep only the variables required for this analysis.

suppressPackageStartupMessages({
  library(dplyr)
  library(stringr)
  library(ggplot2)
  library(scales)
  library(forcats)
  library(tidyr)
})

# read compressed csv.bz2 directly with base R
dt_raw <- read.csv(bzfile(data_path), stringsAsFactors = FALSE)

vars_needed <- c("EVTYPE","BGN_DATE","FATALITIES","INJURIES",
                 "PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")

# standard data.frame subsetting
dt <- dt_raw[, vars_needed]

rm(dt_raw)  # free memory
str(dt)

## 'data.frame':    902297 obs. of  8 variables:
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...

2.3 Basic cleaning

Convert dates to Date and restrict to 1996–2011 (recommended due to standardized event catalog).
Normalize event type strings (trim/uppercase).
Convert damage exponents into multipliers and compute dollar damages.

# 1) Dates and period filter
dt <- dt %>%
  mutate(
    BGN_DATE = as.Date(BGN_DATE, format = "%m/%d/%Y %H:%M:%S"),
    year = as.integer(format(BGN_DATE, "%Y"))
  ) %>%
  filter(year >= 1996)   # primary analysis window

# 2) Normalize EVTYPE (string hygiene)
dt <- dt %>%
  mutate(EVTYPE = str_to_upper(str_trim(EVTYPE)))

# 3) Map exponent codes to multipliers
exp_map <- function(x) {
  # NOAA documentation commonly uses:
  # H/h=1e2, K/k=1e3, M/m=1e6, B/b=1e9
  # Numeric digits 0–8 sometimes appear and are interpreted as 10^digit.
  # Unrecognized or blank => multiplier 1
  x <- str_to_upper(str_trim(as.character(x)))
  out <- rep(1, length(x))
  out[x %in% c("H")] <- 1e2
  out[x %in% c("K")] <- 1e3
  out[x %in% c("M")] <- 1e6
  out[x %in% c("B")] <- 1e9
  # numeric digits
  is_digit <- str_detect(x, "^[0-8]$")
  out[is_digit] <- 10^(as.numeric(x[is_digit]))
  out
}

dt <- dt %>%
  mutate(
    prop_mult = exp_map(PROPDMGEXP),
    crop_mult = exp_map(CROPDMGEXP),
    prop_dmg_usd = PROPDMG * prop_mult,
    crop_dmg_usd = CROPDMG * crop_mult,
    econ_dmg_usd = prop_dmg_usd + crop_dmg_usd,
    casualties = FATALITIES + INJURIES
  )
summary(select(dt, FATALITIES, INJURIES, prop_dmg_usd, crop_dmg_usd))

##    FATALITIES           INJURIES          prop_dmg_usd      
##  Min.   :  0.00000   Min.   :0.000e+00   Min.   :0.000e+00  
##  1st Qu.:  0.00000   1st Qu.:0.000e+00   1st Qu.:0.000e+00  
##  Median :  0.00000   Median :0.000e+00   Median :0.000e+00  
##  Mean   :  0.01336   Mean   :8.871e-02   Mean   :5.612e+05  
##  3rd Qu.:  0.00000   3rd Qu.:0.000e+00   3rd Qu.:1.250e+03  
##  Max.   :158.00000   Max.   :1.150e+03   Max.   :1.150e+11  
##   crop_dmg_usd      
##  Min.   :0.000e+00  
##  1st Qu.:0.000e+00  
##  Median :0.000e+00  
##  Mean   :5.318e+04  
##  3rd Qu.:0.000e+00  
##  Max.   :1.510e+09

2.3.1 Optional light harmonization of event types

Event type (EVTYPE) historically contains many near-duplicates. For this assignment, we apply light harmonization via simple pattern rules. This avoids heavy, manual reclassification but reduces obvious fragmentation (e.g., TSTM WIND → THUNDERSTORM WIND). You may refine these rules if you wish.

recode_evtype <- function(x) {
  x <- str_replace_all(x, "TSTM", "THUNDERSTORM")
  x <- str_replace_all(x, "THUNDERSTORM WINDS?", "THUNDERSTORM WIND")
  x <- str_replace_all(x, "^HEAT WAVE$", "EXCESSIVE HEAT")
  x <- str_replace_all(x, "RIP CURRENTS?", "RIP CURRENT")
  x <- str_replace_all(x, "FLASH FLOODING", "FLASH FLOOD")
  x <- str_replace_all(x, "FLOOD/FLASH FLOOD", "FLASH FLOOD")
  x <- str_replace_all(x, "HURRICANE.*", "HURRICANE/TYPHOON")
  x <- str_replace_all(x, "TYPHOON.*", "HURRICANE/TYPHOON")
  x <- str_replace_all(x, "STORM SURGE/TIDE", "STORM SURGE")
  x <- str_replace_all(x, "EXTREME COLD/WIND CHILL", "EXTREME COLD")
  x <- str_replace_all(x, "WILD/FOREST FIRE", "WILDFIRE")
  x <- str_replace_all(x, "WINTER WEATHER/MIX", "WINTER WEATHER")
  x <- str_replace_all(x, "FOG", "DENSE FOG")
  x
}

dt <- dt %>% mutate(EVTYPE = recode_evtype(EVTYPE))

2.4 Aggregation for analysis

We compute totals by EVTYPE for health and economic impacts.

by_event <- dt %>%
  group_by(EVTYPE) %>%
  summarise(
    fatalities = sum(FATALITIES, na.rm = TRUE),
    injuries   = sum(INJURIES,   na.rm = TRUE),
    casualties = sum(casualties,  na.rm = TRUE),
    prop_usd   = sum(prop_dmg_usd, na.rm = TRUE),
    crop_usd   = sum(crop_dmg_usd, na.rm = TRUE),
    econ_usd   = sum(econ_dmg_usd, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(casualties))

head(by_event)

3 Results

3.1 Which event types are most harmful to population health?

We define population health harm as the combined total of fatalities + injuries. The figure below shows the top 10 event types by total casualties, with the bar split into injuries and fatalities to reveal composition.

topN <- 10
top_health <- by_event %>%
  slice_max(order_by = casualties, n = topN) %>%
  mutate(EVTYPE = fct_reorder(EVTYPE, casualties))

health_long <- top_health %>%
  select(EVTYPE, fatalities, injuries) %>%
  pivot_longer(cols = c(fatalities, injuries),
               names_to = "type", values_to = "count")

ggplot(health_long, aes(x = EVTYPE, y = count, fill = type)) +
  geom_col() +
  coord_flip() +
  scale_y_continuous(labels = comma) +
  labs(x = NULL, y = "People affected (count)",
       fill = "Casualty type",
       title = "Population Health Impact of Severe Weather (1996–2011)",
       subtitle = "Top 10 event types by fatalities + injuries") +
  theme_minimal(base_size = 11)

Top 10 event types by casualties (fatalities + injuries), 1996–2011. Bars are stacked to show injuries (upper) and fatalities (lower).

Interpretation (concise): Tornadoes typically rank at or near the top for total casualties, with excessive heat also contributing a large share of fatalities. Thunderstorm wind, floods/flash floods, and lightning are also prominent contributors to overall harm.

3.2 Which event types have the greatest economic consequences?

We measure economic impact as the sum of property and crop damages after converting exponent codes into dollar amounts. The figure below shows the top 10 event types by total economic losses.

top_econ <- by_event %>%
  arrange(desc(econ_usd)) %>%
  slice_head(n = topN) %>%
  mutate(EVTYPE = fct_reorder(EVTYPE, econ_usd))

econ_long <- top_econ %>%
  select(EVTYPE, prop_usd, crop_usd) %>%
  pivot_longer(cols = c(prop_usd, crop_usd),
               names_to = "component", values_to = "usd")

ggplot(econ_long, aes(x = EVTYPE, y = usd, fill = component)) +
  geom_col() +
  coord_flip() +
  scale_y_continuous(labels = label_dollar(scale_cut = cut_short_scale())) +
  labs(x = NULL, y = "2011 USD (nominal)",
       fill = "Damage component",
       title = "Economic Impact of Severe Weather (1996–2011)",
       subtitle = "Top 10 event types by total damage") +
  theme_minimal(base_size = 11)

Top 10 event types by total economic losses (property + crop), 1996–2011.

Interpretation (concise): Flood-related events (riverine floods, flash floods) and tropical cyclones (hurricanes/typhoons, storm surge) typically dominate total losses, with tornadoes and hail also significant contributors. Crop damages are comparatively larger for drought and hail than for many wind hazards.

4 Sensitivity check (optional)

Results are similar—but more skewed toward tornadoes—if the full 1950–2011 period is used (because earlier decades under-report many non-tornado hazards). To reproduce that view, set filter(year >= 1950) in the cleaning step and re-run.

5 Reproducibility notes

Starting point: All processing begins from the original StormData.csv.bz2 file; no external preprocessing was performed.
Transformations justified: We restricted to 1996–2011 to align with standardized event definitions; normalized event strings; and converted exponent codes (H/K/M/B and digits 0–8) into dollar multipliers.
Packages: dplyr, stringr, ggplot2, scales, forcats, and tidyr.
Figures: Two figures (each one plot) with descriptive captions; within the 3-figure limit.

6 Appendix: Session info

sessionInfo()

## R version 4.5.1 (2025-06-13)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.6.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] tidyr_1.3.1   forcats_1.0.0 scales_1.4.0  ggplot2_3.5.2 stringr_1.5.1
## [6] dplyr_1.1.4  
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       jsonlite_2.0.0     compiler_4.5.1     tidyselect_1.2.1  
##  [5] jquerylib_0.1.4    yaml_2.3.10        fastmap_1.2.0      R6_2.6.1          
##  [9] labeling_0.4.3     generics_0.1.4     knitr_1.50         tibble_3.3.0      
## [13] bslib_0.9.0        pillar_1.11.0      RColorBrewer_1.1-3 rlang_1.1.6       
## [17] cachem_1.1.0       stringi_1.8.7      xfun_0.52          sass_0.4.10       
## [21] cli_3.6.5          withr_3.0.2        magrittr_2.0.3     digest_0.6.37     
## [25] grid_4.5.1         rstudioapi_0.17.1  lifecycle_1.0.4    vctrs_0.6.5       
## [29] evaluate_1.0.4     glue_1.8.0         farver_2.1.2       codetools_0.2-20  
## [33] rmarkdown_2.29     purrr_1.1.0        tools_4.5.1        pkgconfig_2.0.3   
## [37] htmltools_0.5.8.1

7 Appendix: References

National Weather Service: Storm Data Documentation (event definitions).
National Climatic Data Center (NCDC): Storm Events FAQ.
NOAA Storm Database: StormData.csv.bz2 (accessed via Coursera course resources).

Which U.S. Severe Weather Events Harm People and Economies the Most? — An Exploration of NOAA Storm Data (1950–2011)