Synopsis

This report provides a comprehensive analysis of the NOAA Storm Database (1950–2011) to assess the impact of weather events across the United States, focusing on population health and economic consequences. The study addresses two primary questions: (1) which weather events are most harmful to population health, measured by total fatalities and injuries, and (2) which events cause the greatest economic damage, quantified by combined property and crop losses in USD. Additionally, it evaluates the average harm per event occurrence to identify events with the highest magnitude of impact per instance.

Key findings reveal that “Tornado,” “Thunderstorm,” and “Flood” rank as the top three events for total health impacts, with “Excessive Heat” in fourth place, reflecting significant combined fatalities and injuries. Economically, “Flood,” “Hurricane/Typhoon,” “Tornado,” and “Storm Surge” dominate the top four for total damage, with losses reaching multiple billions of USD. When considering average harm per occurrence, “Hurricane/Typhoon,” “Tsunami,” “Glaze,” and “Excessive Heat” emerge as the most impactful, indicating their severe per-event consequences despite varying frequencies. These insights, derived from data spanning over six decades (with noted limitations in early years), underscore the need for targeted preparedness strategies. The report includes three visualizations: total health impacts (Figure 1), total economic damage (Figure 2), and average harm per occurrence (Figure 3).


Data Processing

This section outlines the process of acquiring and preparing the NOAA Storm Database for analysis. The raw data, provided as a compressed .csv.bz2 file, is read directly from the source website using the fread function from the data.table package, which efficiently handles large compressed files. The dataset contains records of major storms and weather events in the United States from 1950 to November 2011, including details on event types, locations, fatalities, injuries, and property damage.

Package Installation & Loading

First, it’s important to ensure that all necessary R packages for running this analysis & report are installed and loaded. If you are running this analysis for the first time, these packages will be installed automatically.

required_packages <- c("data.table", "lubridate", "ggplot2", "cowplot", "reshape2")

# Check if packages are installed, install if not, then load them
for (pkg in required_packages) {
  if (!require(pkg, character.only = TRUE)) {
    install.packages(pkg, dependencies = TRUE)
    library(pkg, character.only = TRUE)
  }
}

Data Acquisition

The NOAA Storm Database is accessed directly from the URL without saving it locally. The fread function reads the compressed .csv.bz2 file into a data.table object for efficient processing. To optimize performance, caching is enabled for this time-consuming operation.

options(knitr.cache = TRUE)

data_url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
storm_data <- fread(data_url, stringsAsFactors = FALSE)

Data Cleaning & Preparation

The raw NOAA Storm Database requires cleaning to ensure data quality, consistency, and usability for addressing the project’s key questions: identifying weather events most harmful to population health (via fatalities and injuries) and those with the greatest economic consequences (via property and crop damage). Without proper cleaning, inconsistencies such as varying case in event types, missing values, or improperly formatted damage estimates could lead to inaccurate aggregations, biased results, or errors in the analysis.

For instance, the database spans decades with evolving recording practices, leading to potential data entry variations and incompleteness in earlier years. The cleaning steps are designed to mitigate these issues while preserving as much valid data as possible, ensuring reproducibility, and aligning with the National Weather Service’s documentation on variable definitions. The following subsections detail each step with its rationale and corresponding code.

1. Inspecting the Dataset

Summarizing key columns (EVTYPE, FATALITIES, INJURIES, PROPDMG, CROPDMG) helps understand the data’s structure, ranges, and potential issues like outliers or data types. We also explicitly check for missing values in these critical columns to quantify data gaps and inform handling strategies, ensuring we make informed decisions about data retention or removal.

# Inspect the dataset
summary(storm_data[, .(EVTYPE, FATALITIES, INJURIES, PROPDMG, CROPDMG)])
##     EVTYPE            FATALITIES           INJURIES            PROPDMG       
##  Length:902297      Min.   :  0.00000   Min.   :   0.0000   Min.   :   0.00  
##  Class :character   1st Qu.:  0.00000   1st Qu.:   0.0000   1st Qu.:   0.00  
##  Mode  :character   Median :  0.00000   Median :   0.0000   Median :   0.00  
##                     Mean   :  0.01678   Mean   :   0.1557   Mean   :  12.06  
##                     3rd Qu.:  0.00000   3rd Qu.:   0.0000   3rd Qu.:   0.50  
##                     Max.   :583.00000   Max.   :1700.0000   Max.   :5000.00  
##     CROPDMG       
##  Min.   :  0.000  
##  1st Qu.:  0.000  
##  Median :  0.000  
##  Mean   :  1.527  
##  3rd Qu.:  0.000  
##  Max.   :990.000
# Check for missing values in critical columns
missing_summary <- storm_data[, .(
  EVTYPE_missing = sum(is.na(EVTYPE)),
  FATALITIES_missing = sum(is.na(FATALITIES)),
  INJURIES_missing = sum(is.na(INJURIES)),
  PROPDMG_missing = sum(is.na(PROPDMG)),
  CROPDMG_missing = sum(is.na(CROPDMG))
)]
print(missing_summary)
##    EVTYPE_missing FATALITIES_missing INJURIES_missing PROPDMG_missing
##             <int>              <int>            <int>           <int>
## 1:              0                  0                0               0
##    CROPDMG_missing
##              <int>
## 1:               0

The output of missing_summary indicates no missing values (NA) in the critical columns (EVTYPE, FATALITIES, INJURIES, PROPDMG, CROPDMG). This suggests the dataset is complete for these variables, which simplifies our cleaning process. However, we proceed with checks for other potential issues, such as empty strings or invalid entries, in subsequent steps.

2. Handling Missing or Invalid Values

Although the inspection shows no NA values in the critical columns, we include a step to handle potential edge cases, such as empty strings in EVTYPE or zero values in damage columns that might reflect unreported data. For EVTYPE, we ensure no empty strings (e.g., ““) exist, as this column is central to grouping events for analysis. Rows with empty EVTYPE strings would be meaningless for aggregating health or economic impacts and are removed. For PROPDMG and CROPDMG, values of zero are retained, as the documentation suggests unreported damage implies no significant loss, but we verify that no negative values exist, which would be invalid for damage estimates.

# Remove rows with empty or whitespace-only EVTYPE
storm_data <- storm_data[EVTYPE != "" & !is.na(trimws(EVTYPE))]

# Verify no negative values in damage columns
storm_data[PROPDMG < 0, PROPDMG := 0]
storm_data[CROPDMG < 0, CROPDMG := 0]

3. Standardizing Event Types & Removing Duplicate Entries

The EVTYPE column may contain inconsistencies due to manual entry over many years, such as variations in casing (e.g., “Tornado” vs. “TORNADO”), extra spaces, or minor misspellings. Converting to uppercase and trimming whitespace reduces these duplicates, allowing accurate aggregation by event type.

# Standardize EVTYPE
storm_data[, EVTYPE := toupper(trimws(EVTYPE))]

The EVTYPE column also contains near-duplicate entries due to abbreviations or slight variations in naming (e.g., “TSTM WIND” and “THUNDERSTORM WIND” both refer to winds produced by thunderstorms). Combining these duplicates is essential to avoid underestimating the impact of similar events in aggregations for population health or economic consequences. Based on the official event types from the National Weather Service Storm Data Documentation, we map common variations to standardized official names. Note that “FLOOD” and “FLASH FLOOD” are kept separate, as they are distinct official event types (general flooding vs. sudden flooding), unless specified otherwise.

# Map common duplicates to official event types
storm_data[grepl("TSTM WIND|THUNDERSTORM WINDS|THUNDERSTORMWIND|THUNDERTORM WIND|THUDERSTORM WINDS|THUNDERSTROM WIND|THUNDERSTORM WIND.*|TSTMW|THUNDERSTORMW|THUNDERSTORM WINDS.*|SEVERE THUNDERSTORM.*", EVTYPE), EVTYPE := "THUNDERSTORM"]
storm_data[grepl("HIGH WIND.*|STRONG WIND.*|GUSTY WIND", EVTYPE), EVTYPE := "HIGH WIND"]
storm_data[grepl("EXCESSIVE HEAT|HEAT WAVE|HEATWAVE|EXTREME HEAT", EVTYPE), EVTYPE := "EXCESSIVE HEAT"]
storm_data[grepl("WINTER STORM|WINTER WEATHER|MIXED PRECIP|WINTRY MIX|BLIZZARD|ICE STORM", EVTYPE), EVTYPE := "WINTER STORM"]
storm_data[grepl("FLASH FLOOD.*|FLASH FLOODING", EVTYPE), EVTYPE := "FLASH FLOOD"]
storm_data[grepl("FLOOD.*|URBAN FLOOD|COASTAL FLOOD", EVTYPE), EVTYPE := "FLOOD"]
storm_data[grepl("HURRICANE/TYPHOON|HURRICANE.*|TYPHOON.*", EVTYPE), EVTYPE := "HURRICANE/TYPHOON"]

4. Processing Damage Variables

Property and crop damage are recorded in two parts: a numeric base (PROPDMG, CROPDMG) and an exponent code (PROPDMGEXP, CROPDMGEXP) such as “K” (thousands), “M” (millions), or “B” (billions), as per the Storm Data Documentation. Without conversion, these cannot be summed meaningfully for total economic consequences.

We define a function to map exponents to numeric multipliers (defaulting to 1 for empty or invalid codes to avoid data loss) and compute TOTAL_DMG in consistent USD units. This step ensures comparable economic impacts across events.

# Define function to convert damage exponents to numeric multipliers
convert_exp <- function(exp) {
  exp <- toupper(exp)
  ifelse(exp == "K", 1e3,
         ifelse(exp == "M", 1e6,
                ifelse(exp == "B", 1e9, 1)))
}

# Convert damage exponents and calculate total damage in USD
storm_data[, PROPDMGEXP := ifelse(is.na(PROPDMGEXP) | PROPDMGEXP == "", 1, convert_exp(PROPDMGEXP))]
storm_data[, CROPDMGEXP := ifelse(is.na(CROPDMGEXP) | CROPDMGEXP == "", 1, convert_exp(CROPDMGEXP))]
storm_data[, TOTAL_DMG := PROPDMG * PROPDMGEXP + CROPDMG * CROPDMGEXP]

5. Parsing Dates

The BGN_DATE column is in a string format (e.g., “MM/DD/YYYY HH:MM:SS”), which is not suitable for chronological analysis or filtering. Using lubridate::mdy_hms, we convert it to a POSIXct date-time object with UTC timezone to enable time-based insights, such as trends over years or seasonal patterns.

# Parse BGN_DATE to date format
storm_data[, BGN_DATE := mdy_hms(BGN_DATE, tz = "UTC")]

Verifying the Cleaned Data

After cleaning, we verify the transformed dataset by showing head of the key variables to ensure the transformations were applied correctly. This step confirms that event types are standardized, health and damage metrics are complete, and dates are properly formatted, setting the stage for accurate analysis of population health and economic impacts.

# Verify cleaned data
head(storm_data[, .(EVTYPE, FATALITIES, INJURIES, TOTAL_DMG, BGN_DATE)])
##     EVTYPE FATALITIES INJURIES TOTAL_DMG   BGN_DATE
##     <char>      <num>    <num>     <num>     <POSc>
## 1: TORNADO          0       15     25000 1950-04-18
## 2: TORNADO          0        0      2500 1950-04-18
## 3: TORNADO          0        2     25000 1951-02-20
## 4: TORNADO          0        2      2500 1951-06-08
## 5: TORNADO          0        2      2500 1951-11-15
## 6: TORNADO          0        6      2500 1951-11-15

Analysis

This section presents the analysis of the NOAA Storm Database to address two questions:

Which types of weather events are most harmful to population health?

Which types have the greatest economic consequences?

Population Health Impacts

To determine the weather events most harmful to population health, we aggregate FATALITIES and INJURIES by EVTYPE to capture both fatal and non-fatal impacts. We select the top 10 event types based on total injuries and visualize both fatalities and injuries within each bar to provide a comprehensive view of health risks. The bars are ordered by the total number of combined injuries and fatalities, with fatalities shown in orange and injuries in blue.

# Aggregate fatalities and injuries by event type
health_impact <- storm_data[, .(
  Total_Fatalities = sum(FATALITIES),
  Total_Injuries = sum(INJURIES)
), by = EVTYPE]

# Calculate combined health impact
health_impact[, Total_Health_Impact := Total_Fatalities + Total_Injuries]

# Get top 10 event types by combined health impact
top_health <- health_impact[order(-Total_Health_Impact)][1:10]

# Reshape data for combined bar plot
health_long <- melt(top_health[, .(EVTYPE, Total_Fatalities, Total_Injuries)], 
                    id.vars = "EVTYPE", measure.vars = c("Total_Fatalities", "Total_Injuries"),
                    variable.name = "Impact_Type", value.name = "Count")
# Create combined bar plot ordered by combined health impact from largest to smallest
p1 <- ggplot(health_long, aes(x = factor(EVTYPE, levels = rev(top_health$EVTYPE)), y = Count, fill = Impact_Type)) +
  geom_bar(stat = "identity", position = "stack") +
  scale_fill_manual(values = c("Total_Fatalities" = "#D55E00", "Total_Injuries" = "#0072B2"),
                    labels = c("Fatalities", "Injuries")) +
  coord_flip() +
  labs(x = "Event Type", y = "Total Count", title = "Top 10 Weather Events - Health Impacts",
       fill = "Impact Type") +
  scale_y_continuous(labels = scales::comma, breaks = scales::pretty_breaks(n = 5)) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 12),
    axis.text = element_text(color = "black"),
    axis.title = element_text(face = "bold"),
    panel.grid.major = element_line(color = "grey80"),
    panel.grid.minor = element_blank(),
    plot.background = element_rect(color = "grey", linewidth = 1),
    plot.margin = unit(c(0.5, 1.5, 0.5, 0.5), "cm")
  )

# Display the plot
print(p1)
This bar plot displays the health impacts (Injuries and Fatalities) of the top 10 weather event types.

This bar plot displays the health impacts (Injuries and Fatalities) of the top 10 weather event types.

Figure 1: Top Weather Events Impacting Population Health

This bar plot displays the top 10 weather event types by total injuries in the NOAA Storm Database (1950–2011), with each bar showing both injuries (blue) and fatalities (orange) stacked together. The events are ordered by the total number, highlighting tornadoes and thunderstorms as leading causes of harm, while also revealing significant harm counts for floods and excessive heat.

Economic Consequences

To identify weather events with the greatest economic consequences, we aggregate the TOTAL_DMG (combined property and crop damage in USD) by EVTYPE and select the top 10 event types. Since damage values span orders of magnitude (e.g., millions to billions), we use a logarithmic linear scale for visualization to ensure readability while preserving relative comparisons.

# Aggregate total damage by event type
economic_impact <- storm_data[, .(Total_Damage = sum(TOTAL_DMG)), by = EVTYPE]

# Get top 10 event types by total damage
top_damage <- economic_impact[order(-Total_Damage)][1:10]
# Create plot for total economic damage
p2 <- ggplot(top_damage, aes(x = reorder(EVTYPE, Total_Damage), y = Total_Damage / 1e9)) +
  geom_bar(stat = "identity", fill = "#009E73") +
  coord_flip() +
  scale_y_continuous(labels = scales::comma, breaks = scales::pretty_breaks(n = 10)) +
  labs(x = "Event Type", y = "Total Damage (Billions USD)", title = "Top 10 Weather Events by Economic Damage") +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 12),
    axis.text = element_text(color = "black"),
    axis.title = element_text(face = "bold"),
    panel.grid.major = element_line(color = "grey80"),
    panel.grid.minor = element_blank(),
    plot.background = element_rect(color = "grey", linewidth = 1),
    plot.margin = unit(c(0.5, 1.5, 0.5, 0.5), "cm")
  )

# Display the plot
print(p2)
This plot displays the top 10 event types ordered by total economic damage (property and crop damage combined).

This plot displays the top 10 event types ordered by total economic damage (property and crop damage combined).

Figure 2: Top Weather Events Impacting the Economy

This bar plot illustrates the top 10 weather event types by total economic damage (property and crop combined, in billions USD) in the NOAA Storm Database (1950–2011). Events like “FLOOD” (including flash floods), “HURRICANE/TYPHOON”, and “TORNADO” cause the highest economic losses, which could possibly be due to aggregating the near-duplicate event types.

Magnitude of Harm Per Event Occurrence

To identify weather events with the largest magnitude of average harm per occurrence, we calculate the average health harm (fatalities + injuries per event) and average economic damage (TOTAL_DMG per event) for each EVTYPE. We filter for events with at least 10 occurrences to avoid outliers from rare events. To compare these metrics on a single plot, we normalize each average by its maximum value across all events (0-1 scale) and compute a combined normalized score. The top 10 events by this combined score are selected for visualization.

# Calculate statistics per event type
event_stats <- storm_data[, .(
  Occurrences = .N,
  Total_Health_Harm = sum(FATALITIES + INJURIES),
  Total_Economic_Damage = sum(TOTAL_DMG)
), by = EVTYPE]

event_stats[, Avg_Health_Harm := Total_Health_Harm / Occurrences]
event_stats[, Avg_Economic_Damage := Total_Economic_Damage / Occurrences]

# Filter for events with at least 10 occurrences
event_stats_filtered <- event_stats[Occurrences >= 10]

# Normalize averages (0-1 scale)
event_stats_filtered[, Norm_Avg_Health := Avg_Health_Harm / max(Avg_Health_Harm)]
event_stats_filtered[, Norm_Avg_Econ := Avg_Economic_Damage / max(Avg_Economic_Damage)]

# Calculate combined normalized score
event_stats_filtered[, Combined_Norm := (Norm_Avg_Health + Norm_Avg_Econ) / 2]

# Get top 10 by combined normalized score
top_combined <- event_stats_filtered[order(-Combined_Norm)][1:10]

# Reshape for plotting
#avg_long <- melt(top_combined[, .(EVTYPE, Norm_Avg_Health, Norm_Avg_Econ)], 
                 #id.vars = "EVTYPE", variable.name = "Metric", value.name = "Value")
# Create grouped bar plot
p3 <- ggplot(top_combined, aes(x = factor(EVTYPE, levels = rev(top_combined$EVTYPE)), y = Combined_Norm)) +
  geom_bar(stat = "identity", position = "dodge", fill = "#F0E442") +
  coord_flip() +
  labs(x = "Event Type", y = "Combined Average Harm per Event (0-1 Scale)", title = "Top 10 Events by Average Harm per Occurrence",
       fill = "Metric") +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 12),
    axis.text = element_text(color = "black"),
    axis.title = element_text(face = "bold"),
    panel.grid.major = element_line(color = "grey80"),
    panel.grid.minor = element_blank(),
    plot.background = element_rect(color = "grey", linewidth = 1),
    plot.margin = unit(c(0.5, 1.5, 0.5, 0.5), "cm")  # Ensure no cutoff
  )

# Display the plot
print(p3)
This plot displays the top 10 event types ordered by average harm per occurrence.

This plot displays the top 10 event types ordered by average harm per occurrence.

Figure 3: Top 10 Weather Events by Average Harm per Occurrence

This grouped bar plot displays the top 10 weather event types by normalized average harm per occurrence in the NOAA Storm Database (1950–2011) normalized to a 0-1 scale for comparison. The events are ordered by a combined normalized score, highlighting events like “HURRICANE/TYPHOON” or “TSUNAMI” that have high average impact per occurrence.


Results

The analysis of the NOAA Storm Database (1950–2011) yields clear insights into weather event impacts. For population health, the top three events by total combined fatalities and injuries are “Tornado,” “Thunderstorm,” and “Flood,” with “Excessive Heat” ranking fourth, reflecting their widespread harm across the U.S. Economically, “Flood” leads with the highest total damage, followed by “Hurricane/Typhoon,” “Tornado,” and “Storm Surge” in the top four, with damages escalating into billions of USD due to their scale and frequency. When assessing average harm per occurrence, “Hurricane/Typhoon” tops the list, followed by “Tsunami,” “Glaze,” and “Excessive Heat,” highlighting their intense per-event impact despite varying occurrence rates. These findings, derived from standardized event types and filtered for events with at least 10 occurrences, underscore the dual nature of weather risks: frequent events cause large cumulative harm and rare events can deliver catastrophic damage per occurrence.