1 Synopsis

This report analyzes the U.S. National Oceanic and Atmospheric Administration (NOAA) Storm Database (1950–2011) to identify which types of severe weather events are most harmful to population health and which have the greatest economic consequences. We start from the raw compressed CSV (.csv.bz2) file, perform minimal and transparent preprocessing, and compute event-level summaries for fatalities, injuries, and economic damages (property + crops). Results are reproducible: all code is shown and figures are generated from the data.

Reproducibility note: Heavy steps are cached with cache=TRUE. Place or let the code download repdata_data_StormData.csv.bz2 next to this file.

2 Data Processing

2.1 Data source

  • Raw file: repdata_data_StormData.csv.bz2 (Coursera prompt; originally NOAA Storm Database).
  • Coverage: 1950–Nov 2011. Early years under-reported; recent years are more complete.
knitr::opts_chunk$set(echo=TRUE, message=FALSE, warning=FALSE, cache=TRUE, fig.align="center")
options(stringsAsFactors = FALSE)

# Minimal required packages
needed <- c("dplyr", "ggplot2")
to_install <- setdiff(needed, rownames(installed.packages()))
if (length(to_install)) install.packages(to_install, dependencies = TRUE)

library(dplyr)
library(ggplot2)

# Expected file in the same folder as this .Rmd
dest <- "repdata_data_StormData.csv.bz2"

# Auto-download if missing
if (!file.exists(dest)) {
  url <- "https://d396qusza40orc.cloudfront.net/repdata/data/StormData.csv.bz2"
  download.file(url, destfile = dest, mode = "wb", quiet = TRUE)
}

if (!file.exists(dest)) stop("Data file not found: ", dest)

# Robust read with base R (no readr dependency)
storm <- read.csv(bzfile(dest))

# Keep only relevant columns (fail fast if any missing)
req_cols <- c("EVTYPE","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")
missing <- setdiff(req_cols, names(storm))
if (length(missing)) stop("Missing columns in dataset: ", paste(missing, collapse=", "))
storm <- storm[, req_cols]

2.2 Transformations and justification

  1. Normalize event types: uppercase + trim (trimws) to reduce trivial variants.
  2. Damage exponents: map H/h=1e2, K/k=1e3, M/m=1e6, B/b=1e9; digits → 10^digit; others/missing → 1.
  3. Totals: prop_loss = PROPDMG * multiplier, crop_loss = CROPDMG * multiplier, total_loss = prop_loss + crop_loss.
  4. Summaries: aggregate by event type for fatalities, injuries, and losses.
# 1) Normalize EVTYPE (base R)
storm$EVTYPE <- toupper(trimws(storm$EVTYPE))

# 2) Convert exponent codes to multipliers (base R)
exp_to_mult <- function(x) {
  letter_map <- c(H=1e2, K=1e3, M=1e6, B=1e9)
  vapply(x, function(v) {
    if (is.na(v) || v == "" || v %in% c("+","-","?")) return(1)
    vv <- as.character(v)
    if (vv %in% names(letter_map)) return(letter_map[[vv]])
    if (vv %in% tolower(names(letter_map))) return(letter_map[[toupper(vv)]])
    if (grepl("^[0-9]$", vv)) return(10^as.numeric(vv))
    1
  }, numeric(1))
}

prop_mult <- exp_to_mult(storm$PROPDMGEXP)
crop_mult <- exp_to_mult(storm$CROPDMGEXP)

storm$prop_loss  <- storm$PROPDMG * prop_mult
storm$crop_loss  <- storm$CROPDMG * crop_mult
storm$total_loss <- storm$prop_loss + storm$crop_loss

# 3) Event-level summaries (dplyr)
event_summ <- storm %>%
  group_by(EVTYPE) %>%
  summarise(
    fatalities = sum(FATALITIES, na.rm = TRUE),
    injuries   = sum(INJURIES,   na.rm = TRUE),
    prop_loss  = sum(prop_loss,  na.rm = TRUE),
    crop_loss  = sum(crop_loss,  na.rm = TRUE),
    total_loss = sum(total_loss, na.rm = TRUE),
    .groups = "drop"
  )

health_tbl <- event_summ %>%
  mutate(health_impact = fatalities + injuries) %>%
  arrange(desc(health_impact))

econ_tbl <- event_summ %>% arrange(desc(total_loss))

# Names for inline text if needed
top_health_name <- health_tbl$EVTYPE[1]
top_econ_name   <- econ_tbl$EVTYPE[1]

# Quick peek
head(health_tbl, 5)
head(econ_tbl, 5)

3 Results

3.1 Which events are most harmful to population health?

We consider fatalities + injuries aggregated by event type.

top_n <- 10
top_health <- dplyr::slice_head(health_tbl, n = top_n)

ggplot(top_health, aes(x = reorder(EVTYPE, health_impact), y = health_impact)) +
  geom_col() +
  coord_flip() +
  labs(
    title = paste0("Top ", top_n, " Event Types by Health Impact (Fatalities + Injuries)"),
    x = "Event Type", y = "Total Fatalities + Injuries"
  ) +
  theme_minimal(base_size = 12)

Figure 1. Total fatalities + injuries by event type (top 10).
The most harmful event type for population health is TORNADO.

3.2 Which events have the greatest economic consequences?

We consider total_loss = property + crop damages in USD (nominal).

top_n <- 10
top_econ <- dplyr::slice_head(econ_tbl, n = top_n)

ggplot(top_econ, aes(x = reorder(EVTYPE, total_loss), y = total_loss / 1e9)) +
  geom_col() +
  coord_flip() +
  labs(
    title = paste0("Top ", top_n, " Event Types by Economic Losses"),
    x = "Event Type", y = "Total Losses (Billion USD)"
  ) +
  theme_minimal(base_size = 12)

Figure 2. Total economic damages (property + crops) by event type (top 10).
The event type with the highest overall economic losses is FLOOD.

4 Discussion and limitations

5 Reproducibility

sessionInfo()
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-apple-darwin20
## Running under: macOS Sequoia 15.3.2
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Santiago
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_3.5.2 dplyr_1.1.4  
## 
## loaded via a namespace (and not attached):
##  [1] vctrs_0.6.5        cli_3.6.5          knitr_1.50         rlang_1.1.6       
##  [5] xfun_0.53          generics_0.1.4     jsonlite_2.0.0     labeling_0.4.3    
##  [9] glue_1.8.0         htmltools_0.5.8.1  sass_0.4.10        scales_1.4.0      
## [13] rmarkdown_2.29     grid_4.5.1         evaluate_1.0.4     jquerylib_0.1.4   
## [17] tibble_3.3.0       fastmap_1.2.0      yaml_2.3.10        lifecycle_1.0.4   
## [21] compiler_4.5.1     codetools_0.2-20   RColorBrewer_1.1-3 pkgconfig_2.0.3   
## [25] rstudioapi_0.17.1  farver_2.1.2       digest_0.6.37      R6_2.6.1          
## [29] tidyselect_1.2.1   pillar_1.11.0      magrittr_2.0.3     bslib_0.9.0       
## [33] withr_3.0.2        gtable_0.3.6       tools_4.5.1        cachem_1.1.0