This report analyzes the NOAA severe weather events database to identify which event types cause the greatest harm to health and the largest economic losses in the United States. The analysis works directly from the original compressed .csv.bz2 file to maintain reproducibility. First, key variables are cleaned and transformed: event type, casualties, injuries, property damage, and crop damage. Next, economic damage exponents are normalized to convert all amounts into comparable dollars. Two aggregate metrics are then built by event type: health impact (fatalities + injuries) and economic impact (property + crop losses). Finally, event types are ranked to obtain the top ten in each dimension, and results are presented in tables and plots. The findings show that the events most damaging to health are not necessarily the same as those with the highest economic cost. This approach helps prioritize preparedness actions according to the dominant risk type.
Severe storms and other weather phenomena can cause both public health and economic problems. This analysis uses the NOAA Storm Database to answer two questions: which events are most harmful to population health, and which events have the greatest economic consequences.
The analysis begins from the raw .csv.bz2 file. If the file does not exist locally, it is downloaded and then read from the compressed source using bzfile.
knitr::opts_chunk$set(echo = TRUE)
url_data <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
local_bz2 <- file.path("data", "StormData.csv.bz2")
if (!dir.exists("data")) {
dir.create("data", recursive = TRUE)
}
if (!file.exists(local_bz2)) {
download.file(url = url_data, destfile = local_bz2, mode = "wb")
}
storm_data <- read.csv(bzfile(local_bz2), stringsAsFactors = FALSE)
dim(storm_data)
## [1] 902297 37
Data preparation consisted of selecting the required variables and applying the transformations needed to answer the project questions: 1. EVTYPE is normalized to reduce spelling and casing variation. 2. Multipliers are created for PROPDMGEXP and CROPDMGEXP. 3. Health impact is computed as FATALITIES + INJURIES. 4. Economic impact is computed as PROPDMG + CROPDMG in dollars.
cols_needed <- c(
"EVTYPE", "FATALITIES", "INJURIES",
"PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP"
)
storm_subset <- storm_data[, cols_needed]
storm_subset$EVTYPE <- toupper(trimws(storm_subset$EVTYPE))
exp_to_multiplier <- function(x) {
x <- toupper(trimws(as.character(x)))
out <- rep(1, length(x))
out[x == "H"] <- 1e2
out[x == "K"] <- 1e3
out[x == "M"] <- 1e6
out[x == "B"] <- 1e9
digit_idx <- grepl("^[0-9]$", x)
out[digit_idx] <- 10 ^ as.numeric(x[digit_idx])
unknown_idx <- x %in% c("", "+", "-", "?", "NA") | is.na(x)
out[unknown_idx] <- 1
out
}
storm_subset$prop_damage <- storm_subset$PROPDMG * exp_to_multiplier(storm_subset$PROPDMGEXP)
storm_subset$crop_damage <- storm_subset$CROPDMG * exp_to_multiplier(storm_subset$CROPDMGEXP)
storm_subset$health_impact <- storm_subset$FATALITIES + storm_subset$INJURIES
storm_subset$economic_impact <- storm_subset$prop_damage + storm_subset$crop_damage
health_by_event <- aggregate(health_impact ~ EVTYPE, data = storm_subset, sum, na.rm = TRUE)
econ_by_event <- aggregate(economic_impact ~ EVTYPE, data = storm_subset, sum, na.rm = TRUE)
health_by_event <- health_by_event[health_by_event$health_impact > 0, ]
econ_by_event <- econ_by_event[econ_by_event$economic_impact > 0, ]
top_health <- head(health_by_event[order(-health_by_event$health_impact), ], 10)
top_econ <- head(econ_by_event[order(-econ_by_event$economic_impact), ], 10)
top_health
## EVTYPE health_impact
## 750 TORNADO 96979
## 108 EXCESSIVE HEAT 8428
## 771 TSTM WIND 7461
## 146 FLOOD 7259
## 410 LIGHTNING 6046
## 235 HEAT 3037
## 130 FLASH FLOOD 2755
## 379 ICE STORM 2064
## 677 THUNDERSTORM WIND 1621
## 880 WINTER STORM 1527
n_health <- nrow(top_health)
health_colors <- character(n_health)
health_colors[order(top_health$health_impact)] <-
grDevices::colorRampPalette(c("#F2E5FF", "#031b5f"))(n_health)
barplot(
height = top_health$health_impact,
names.arg = top_health$EVTYPE,
las = 2,
cex.names = 0.8,
col = health_colors,
main = "Top 10 event types by health impact",
ylab = "Fatalities + Injuries"
)
top_econ
## EVTYPE economic_impact
## 146 FLOOD 150319678257
## 364 HURRICANE/TYPHOON 71913712800
## 750 TORNADO 57362333947
## 591 STORM SURGE 43323541000
## 204 HAIL 18761221986
## 130 FLASH FLOOD 18244041079
## 76 DROUGHT 15018672000
## 355 HURRICANE 14610229010
## 521 RIVER FLOOD 10148404500
## 379 ICE STORM 8967041360
econ_colors_distinct <- grDevices::hcl.colors(nrow(top_econ), palette = "Set 2")
barplot(
height = top_econ$economic_impact / 1e9,
names.arg = top_econ$EVTYPE,
las = 2,
cex.names = 0.8,
col = econ_colors_distinct,
main = "Top 10 event types by economic impact",
ylab = "Total economic damage (billion USD)"
)
In general terms, the findings show that the events with the greatest human impact and those with the highest economic cost do not always coincide. This suggests that preparedness and resource allocation should consider at least two risk dimensions: population health and economic losses.