Synopsis

This report presents an exploratory analysis of the NOAA Storm Events Database from 1950 to 2011 to evaluate the health and economic impact of extreme weather events in the United States. The study identifies which event types cause the highest number of fatalities and which result in the greatest economic losses. Tornadoes are the leading cause of weather-related fatalities, while floods account for the most substantial economic damage. Most event types show consistent reporting however, issues such as classification inconsistencies, reporting bias and incomplete early records necessitate validation with secondary data sources prior to deeper analysis.

Data Processing

The analysis uses NOAA’s Storm Events Database. The events in the database start in the year 1950 and end in November 2011. The raw data was loaded into a data.table for analysis and graphing. Data transformations were minimal:

The annotated data import R code is here for grading.
# Ensure R.utils is installed for fread bz2 support
if (!requireNamespace("R.utils", quietly = TRUE)) {
  install.packages("R.utils")
}

# download file as directed in assignment
download.file(
  url ="https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
  destfile = "StormData.csv.bz2",
  mode = "wb")

# store Event type as a factor instead of a character string
storm_data <- fread("StormData.csv.bz2")

# store Event type as a factor instead of a character string
storm_data[, (c("EVTYPE")) := lapply(.SD, as.factor), .SDcols = c("EVTYPE")]

####################################################
# alphabetical characters used to signify magnitude
# should be entered as actual dollar amounts,
# include “K” for thousands, “M” for millions, and “B” for billions
actual_dollar_amount <- function(DMG, DMGEXP) {
  multipliers <- c("K" = 1e3, "M" = 1e6, "B" = 1e9)
  factor <- multipliers[as.character(DMGEXP)]
  factor[is.na(factor)] <- 1
  return (unname(DMG * factor))
}

# standardise XROPDMG columns to use single dollar amounts
storm_data[,PROPDMG:=actual_dollar_amount(PROPDMG,PROPDMGEXP)][,PROPDMGEXP:=NULL]
storm_data[,CROPDMG:=actual_dollar_amount(CROPDMG,CROPDMGEXP)][,CROPDMGEXP:=NULL]

# drop missing data (there aren't any)
storm_data<-storm_data[complete.cases(
  storm_data$EVTYPE,
  storm_data$FATALITIES,
  storm_data$INJURIES,
  storm_data$PROPDMG,
  storm_data$CROPDMG
), ]

Methodology

A cursory appraisal of the data was first conducted to gain an initial sense of its structure and content. Based on this, a number of objectives were pre-registered before performing a more detailed exploratory analysis. The resulting findings were subsequently compared with publicly available on-line information.

NOAA Database Considerations

A few significant data integrity issues with the NOAA database were noted and left unaddressed during this short study. The classification of weather events was inconsistent and the classification labels contained obvious typos, aggregated multi-categories, specialist splinter categories, and other reporting anomalies. For instance, all the following categories seem to correspond to a tornado event type:

##  [1] "TORNADO"                    "TORNADO DEBRIS"            
##  [3] "TORNADO F0"                 "TORNADO F1"                
##  [5] "TORNADO F2"                 "TORNADO F3"                
##  [7] "TORNADO/WATERSPOUT"         "TORNADOES"                 
##  [9] "TORNADOES, TSTM WIND, HAIL" "TORNADOS"                  
## [11] "TORNDAO"

It is not suspected that these issues had a major impact on this study however, a more through data munging exercise might be considered prior to any deeper analysis of the NOAA database. Additionally, the earlier years of the database contain generally fewer events recorded. This is likely due to a lack of good records with more recent years considered more complete.

While analysing the structure of the available data, we observed a high (unavoidable) potential for reporting bias. Classification ambiguity seemed unavoidable, as some weather phenomena exist on a continuum—for example, lightning and waterspouts may coincide with tornado events but are not always consistently categorised when categories overlap. It is also possible that severity or casualty counts influence classification decisions. Additionally, an “Excessive Heat” event is recorded when heat index values exceed local thresholds, which likely vary across regions and time, introducing further inconsistencies. Similarly, we found underestimation in reported fatalities and property damage costs, or at the very least inconsistencies, when compared with online materials.

Nevertheless, we feel this data is extremely high quality, and this study successfully gleans insights from the NOAA data. This short study could benefit from incorporating data from alternative sources to improve reliability and address these limitations.

Results

Weather event impact on population health

Extreme U.S. Weather Events: Injuries and Fatalities R Codeis here for grading.
# 1. prep a data.table with total injuries, total fatalities for each event type
plot_data<-storm_data[
    # 1.1. calculate total fatalities and total injuries for each event type
    #      (note here we are simply aggregating events with no filtering)
    ,.(total_fatalities=sum(FATALITIES), total_injuries=sum(INJURIES)),
    by=.(EVTYPE)
  ][
    # 1.2. also removing non-fatal and any injury free event types
    #      (this is simply for log scaling axes - also removes any data points
    #       lying on the x and y axis)
    total_fatalities>0 & total_injuries
  ]

# 2. prepare a second data frame which cherry pick some eye-catching events
#    (The purpose of these labels is to show the data points correspond to 
#     different event types - making the graph more intuitive and 
#     more immediately understandable)
label_data <- plot_data[
    # Pick out some extreme events...
    EVTYPE %in% c(
      "TORNADO", "EXCESSIVE HEAT", "FLOOD", "WILDFIRES", "BLIZZARD",
      "VOLCANIC ERUPTION", "TSUNAMI", "SNOW", "MUDSLIDE", "AVALANCHE")
  ][
    # To label these events I'll need the strings NOT the factor levels 
    , EVTYPE := as.character(EVTYPE)
  ]

# 3. plot injuries vs. fatalities using (1) and add the labels picked in (2)
p1 <- ggplot(
  plot_data,
  aes(x = total_fatalities, y = total_injuries)) +
  geom_point(alpha = 0.3, color="steelblue") +
  scale_x_log10() +
  scale_y_log10() +
  labs(
    title = "Extreme U.S. Weather Events: Injuries and Fatalities",
       subtitle = paste0("Total event type fatalities vs.",
                  "total event type injuries"),
       x = "Fatalities (log scale)",
       y = "Injuries (log scale)") +
  # let's not go nuts and output a formula here - visual is fine
  geom_smooth(method = "lm", se = TRUE, color = "coral")+
  geom_text(
    data = label_data,
    aes(label = EVTYPE),
    vjust = -0.5,
    hjust = 0.5,
    size = 3,
    color = "steelblue")+
  theme_minimal()
Extreme U.S. Weather Events: Fatalities R Code is here for grading.
# 1. look a total deaths for event types with over 100 fatalities
tot_fatal_by_event <- storm_data[
    # 1.1. aggregate fatalities for each event types
    , .(TotalFatalities = sum(FATALITIES)), by = EVTYPE
  ][
    # 1.2. remove event types with less than 100 fatalities
    #   ..I just don't want too many less serious  event types saturating the
    #      labels axis - around 20 event types is plenty
    TotalFatalities > 100
  ]

# 2. plot total deaths for these event types
#   and I'll sort them in descending order for quick visual ranking
p2<- ggplot(tot_fatal_by_event,
            # here's the ranking...
            aes(x = reorder(EVTYPE, -TotalFatalities),
                y = TotalFatalities)) +
  geom_bar(stat = "identity", fill = "tomato") +
  labs(title = "Extreme U.S. Weather Events: Fatalities",
       subtitle = paste0("Weather events ranked by total fatalities",
                         " (for all event types with over 100 recorded fatalities)"),
       caption = "Source: NOAA’s Storm Events Database",
       x = "", y = "Total Fatalities") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The total recorded fatalities provide a reasonable measure of the impact of weather events on population health. The number of reported fatalities in the NOAA storm database includes indirect fatalities as well as direct fatalities - this suits our purposes as the combined figure gives a comprehensive measure of health impact.

While injury data is also available we ignore it supposing some events involving no fatalities may go unreported. The total number of injuries in each event type is positively correlated with fatalities in each event type, this result is shown in the adjacent figure with data for all weather event types in which both fatal and non-fatal-injuries occurred.

Tornados are the cause of the largest number of reported fatalities in the NOAA database. While these contribute the highest total fatalities, they are often not the most fatal per event as many such event instances result in no deaths. Other, less common event types show lower variance and are more deadly on average.

The deadliest single recorded event corresponds to the 1995 Chicago heatwave. No other heat related event come close to this level of fatalities. The case represents an extreme outlier which skews the totals in this event category. This was an exceptionally hot and humid heatwave which resulted in 583 reported deaths with the many deaths potentially under-reported. Many reported deaths are attributed to mortality displacement - “the occurrence of deaths at an earlier time than they would have otherwise occurred, meaning the deaths are displaced from the future into the present” - Wikipedia contributors (2025). The high number of deaths has also been attributed to the lack of preparedness and deprivation in the effected area (National Climatic Data Center (NCDC), now NCEI, 2021). This event highlights the multivariate nature of fatalities namely, the “deadliness” of each event is dependent on background/context e.g. storms over a populace area as opposed to a remote or rural area.

Weather Event Impact on Economics

Extreme U.S. Weather Events: Total Event Type Damage R Code is here for grading.
# 1. get total financial damages for each event type
augm_storm_data<-storm_data[,
           .(`Property Damage`=sum(PROPDMG),
             `Crop Damage`=sum(CROPDMG))
           ,by=.(EVTYPE)]

# 2. weirdly - i'm going to add a redundant overall cost column too
#     (I need it to order the individual costs in terms of total cost
#     ...sometimes to make an omelette you've got to crack some eggs)
augm_storm_data[,`total cost`:=`Property Damage`+`Crop Damage`]

# 3. I'm gonna drop any event types that don't make the $1B threshold
augm_storm_data<-augm_storm_data[`Property Damage`+`Crop Damage`>1e9]

# 4. Okay... melting. Double entries for each event - one for each CostType
long_costs <- melt(augm_storm_data,
                   id.vars = c("EVTYPE","total cost"),
                   measure.vars = c("Property Damage", "Crop Damage"),
                   variable.name = "CostType",
                   value.name = "Cost")

# 5. Plotting a stacked bar graph...
library(ggplot2)
p1 <- ggplot(long_costs,
             aes(x = reorder(EVTYPE,-`total cost`),    # sort using total cost
                 y = Cost/1e9,                         # work in Billion USD
                 fill = CostType)) +                   # colour crops and props
  geom_bar(stat = "identity", position = "stack") +
  labs(title = "Extreme U.S. Weather Events: Total Event Type Damage",
       subtitle = paste0("event types ranked by total cost",
                         " (showing all recorded event types exceeding $1B costs)"),
       caption = "Source: NOAA’s Storm Events Database",
       x = "", y = "Cost (Billion USD)", fill = "") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        legend.position = c(1, 1),
        legend.justification = c("right", "top"))
## Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
## 3.5.0.
## ℹ Please use the `legend.position.inside` argument of `theme()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The economic impact of weather events can be quantified through the total damage costs (both private and public). This data is generally available in the NOAA events database through a mixture of actual and estimated costs. We have retained the cost breakdown between property and crop damage for interest - crop damage is very small in comparison to property damage but not negligible. The flood weather event type is responsible for significantly more costs

The 2005 napa floods costliest ever with $115B of property damage and $32.5M damage to crops. However, this figure appears to be inconsistent with the event’s narrative description and other independent estimates, suggesting a likely data entry or unit error.

According to the accompanying NOAA narrative there is no mention of damage in the scale of billions:

“Major flooding continued into the early hours of January 1st, before the Napa River finally fell below flood stage and the water receeded. Flooding was severe in Downtown Napa from the Napa Creek and the City and Parks Department was hit with $6 million in damage alone. The City of Napa had 600 homes with moderate damage, 150 damaged businesses with costs of at least $70 million.”

Similarly, third party source (Los Angeles County Flood of 2005, 2025) reports:

“In total, the entire Atmospheric River event caused between $200-$300 million in damages.”

In contrast, the second-highest entry in the database is Hurricane Katrina (2005), which is widely recognised as the costliest U.S. weather disaster in history [Insurance Information Institute (2024)](Knabb et al., 2006). It is listed at $31.3B billion in property damage in the NOAA database.

References

Insurance Information Institute. (2024). Spotlight on catastrophes: Insurance issues. https://www.iii.org/article/spotlight-on-catastrophes-insurance-issues. https://www.iii.org/article/spotlight-on-catastrophes-insurance-issues
Knabb, R. D., Rhome, J. R., & Brown, D. P. (2006). Tropical cyclone report: Hurricane katrina, 23–30 august 2005 (AL122005). National Hurricane Center. https://www.nhc.noaa.gov/data/tcr/AL122005_Katrina.pdf
Los angeles county flood of 2005. (2025). Wikipedia. https://en.wikipedia.org/wiki/Los_Angeles_County_flood_of_2005
National Climatic Data Center (NCDC), now NCEI. (2021). Climate history: July 1995 chicago‑area heat wave. NOAA/National Centers for Environmental Information news release. https://web.archive.org/web/20211130151638/https://www.ncdc.noaa.gov/news/climate-history-july-1995-chicago-area-heat-wave
Wikipedia contributors. (2025). 1995 Chicago heat wave. https://en.wikipedia.org/wiki/1995_Chicago_heat_wave.