This analysis explores the NOAA Storm Database to determine which weather events are most harmful to public health and which have the greatest economic consequences in the United States. The analysis covers the period from 1950 to November 2011. I found that Tornadoes are the most harmful events with respect to population health (both fatalities and injuries). In terms of economic consequences, Floods cause the greatest property damage, while Droughts cause the most crop damage. However, combining both property and crop damage, Floods remain the leading cause of economic loss.
The data comes in the form of a comma-separated-value file compressed
via the bzip2 algorithm. We start by loading the necessary libraries and
reading the raw CSV file. I use cache=TRUE to speed up
processing on subsequent runs.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr) # For reshaping data for plotting
# Read the data directly from the bz2 file
zipFile <- "repdata_data_StormData.csv.bz2"
storm_data <- read.csv(zipFile)
# Preview the data structure
dim(storm_data)
## [1] 902297 37
head(storm_data[, c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP")])
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP
## 1 TORNADO 0 15 25.0 K
## 2 TORNADO 0 0 2.5 K
## 3 TORNADO 0 2 25.0 K
## 4 TORNADO 0 2 2.5 K
## 5 TORNADO 0 2 2.5 K
## 6 TORNADO 0 6 2.5 K
The dataset contains many variables, but we only need those related to health (FATALITIES, INJURIES), economy (PROPDMG, CROPDMG), and the event type (EVTYPE).
# Select only relevant columns
df <- storm_data %>%
select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
# Clean EVTYPE: Convert to uppercase to handle inconsistencies (e.g., "Flood" vs "FLOOD")
df$EVTYPE <- toupper(df$EVTYPE)
The variables PROPDMGEXP and CROPDMGEXP
represent the magnitude of the damage (e.g., “K” for thousands, “M” for
millions, “B” for billions). We need to convert these characters into
numeric multipliers.
# Function to map exponents to numeric values
get_multiplier <- function(exp) {
exp <- toupper(exp)
if (exp == "H") return(10^2)
if (exp == "K") return(10^3)
if (exp == "M") return(10^6)
if (exp == "B") return(10^9)
if (exp %in% c("0", "1", "2", "3", "4", "5", "6", "7", "8")) return(10^1) # Treat numbers as 10
if (exp == "+") return(1)
return(0) # Ignore empty or invalid characters
}
# Apply the function to create new numeric columns for actual damage cost
# Note: This step might take a moment due to dataset size
df$PROP_MULTIPLIER <- sapply(df$PROPDMGEXP, get_multiplier)
df$CROP_MULTIPLIER <- sapply(df$CROPDMGEXP, get_multiplier)
df$PROP_COST <- df$PROPDMG * df$PROP_MULTIPLIER
df$CROP_COST <- df$CROPDMG * df$CROP_MULTIPLIER
df$TOTAL_COST <- df$PROP_COST + df$CROP_COST
Aggregate the total number of fatalities and injuries by event type and select the top 10 most harmful events.
health_impact <- df %>%
group_by(EVTYPE) %>%
summarise(Fatalities = sum(FATALITIES),
Injuries = sum(INJURIES)) %>%
arrange(desc(Fatalities + Injuries)) %>%
slice(1:10)
# Reshape for plotting (Wide to Long format)
health_long <- health_impact %>%
pivot_longer(cols = c(Fatalities, Injuries), names_to = "Type", values_to = "Count")
print(health_impact)
## # A tibble: 10 × 3
## EVTYPE Fatalities Injuries
## <chr> <dbl> <dbl>
## 1 TORNADO 5633 91346
## 2 EXCESSIVE HEAT 1903 6525
## 3 TSTM WIND 504 6957
## 4 FLOOD 470 6789
## 5 LIGHTNING 816 5230
## 6 HEAT 937 2100
## 7 FLASH FLOOD 978 1777
## 8 ICE STORM 89 1975
## 9 THUNDERSTORM WIND 133 1488
## 10 WINTER STORM 206 1321
ggplot(health_long, aes(x = reorder(EVTYPE, Count), y = Count, fill = Type)) +
geom_bar(stat = "identity", position = "dodge") + # Side by side bars
coord_flip() + # Horizontal bars for better readability
labs(title = "Top 10 Weather Events Most Harmful to Population Health",
x = "Event Type",
y = "Number of People Affected",
fill = "Health Impact") +
theme_minimal()
Figure 1: This chart shows the top 10 weather events causing the most fatalities and injuries. Tornados are by far the most dangerous event type to population health.
Aggregate the total economic damage (Property + Crop) by event type and select the top 10 events.
econ_impact <- df %>%
group_by(EVTYPE) %>%
summarise(Property = sum(PROP_COST),
Crop = sum(CROP_COST),
Total = sum(TOTAL_COST)) %>%
arrange(desc(Total)) %>%
slice(1:10)
# Reshape for plotting
econ_long <- econ_impact %>%
select(EVTYPE, Property, Crop) %>%
pivot_longer(cols = c(Property, Crop), names_to = "Damage_Type", values_to = "Cost")
print(econ_impact)
## # A tibble: 10 × 4
## EVTYPE Property Crop Total
## <chr> <dbl> <dbl> <dbl>
## 1 FLOOD 144657709800 5661968450 150319678250
## 2 HURRICANE/TYPHOON 69305840000 2607872800 71913712800
## 3 TORNADO 56937162897 414954710 57352117607
## 4 STORM SURGE 43323536000 5000 43323541000
## 5 HAIL 15732269877 3025954650 18758224527
## 6 FLASH FLOOD 16140815011 1421317100 17562132111
## 7 DROUGHT 1046106000 13972566000 15018672000
## 8 HURRICANE 11868319010 2741910000 14610229010
## 9 RIVER FLOOD 5118945500 5029459000 10148404500
## 10 ICE STORM 3944928310 5022113500 8967041810
ggplot(econ_long, aes(x = reorder(EVTYPE, Cost), y = Cost / 10^9, fill = Damage_Type)) +
geom_bar(stat = "identity") + # Stacked bars
coord_flip() +
scale_fill_manual(values = c("Property" = "steelblue", "Crop" = "darkgreen")) +
labs(title = "Top 10 Weather Events with Greatest Economic Consequences",
x = "Event Type",
y = "Total Economic Damage (Billions of USD)",
fill = "Damage Type") +
theme_minimal()
Figure 2: This chart shows the top 10 weather events causing the most economic damage. Floods are the primary cause of economic loss, driven largely by property damage. Droughts are a significant contributor to crop damage.