Analysis of NOAA Storm Database

Synopsis:

This report analyzes the U.S. National Oceanic and Atmospheric Administration (NOAA) storm database for the year 2025 to identify the most severe weather events and their impacts. The goal is to provide actionable insights for government and municipal managers responsible for disaster preparedness and resource allocation.

The analysis examines four key dimensions: population health impact, geographic distribution, seasonal patterns, and economic damage. The results show that Flash Flood events are the most harmful to population health, accounting for the highest number of combined injuries and fatalities by a large margin.

Geographically, states such as Texas experience the highest frequency of severe weather events, particularly Flash Floods and Thunderstorm Winds. Seasonal trends indicate that winter months are dominated by cold-related events, while spring and summer months see increased activity from storms such as hail, thunderstorms, and flooding.

Finally, Flash Floods also represent the most economically damaging events, causing significantly higher property losses compared to other event types. These findings highlight the importance of prioritizing flood mitigation strategies, seasonal preparedness, and region-specific resource planning.

Data Processing:

To begin the analysis, the raw CSV files containing the 2025 storm details, locations, and fatalities were loaded into the R environment using the readr package. These datasets were imported as data frames to enable efficient data manipulation.

Since the relevant information was distributed across multiple files, the datasets were merged into a single comprehensive dataset using the left_join() function from the dplyr package, based on the common EVENT_ID variable. This combined dataset includes event characteristics, geographic information, and fatality records.

Following the data integration, several transformations were performed to support the analysis. To measure the overall impact on population health, total injuries and deaths were calculated by combining both direct and indirect values, and a new variable total_harmed was created. Missing values were handled using appropriate functions (e.g., na.rm = TRUE) to ensure accurate aggregation.

The dataset was then grouped and summarized by key variables such as EVENT_TYPE, STATE, and MONTH_NAME to identify patterns across different dimensions. To improve clarity and visualization, subsets of the data were selected, including the top 10 most harmful events, the top 5 states with the highest event frequency, and the top 3 most common events per month.

For the economic impact analysis, a custom function was developed to convert alphanumeric property damage values (e.g., K, M, B) into numeric format. This allowed for accurate calculation and comparison of total economic damage across event types.

# 1. Loading necessary libraries
# These libraries are used for data manipulation (dplyr), reading CSV files (readr), and creating visualizations (ggplot2), and processing text data (tidytext).
library(dplyr)
library(readr)
library(ggplot2)
library(tidytext)

# 2. Defining the folder path
# This specifies the directory where the NOAA data files are stored.
folder_path <- '/Users/sabreenaaleemnabeela/Desktop/Main Folder/Data Stewardship/Final Project'

# 3. Defining File Paths
# These lines create full file paths for each dataset using the base folder.
details_file <- file.path(folder_path, "StormEvents_details-ftp_v1.0_d2025_c20260323.csv")
fatalities_file <- file.path(folder_path, "StormEvents_fatalities-ftp_v1.0_d2025_c20260323.csv")
locations_file <- file.path(folder_path, "StormEvents_locations-ftp_v1.0_d2025_c20260323.csv")

# 4. Loading the Raw Data
# The read_csv() function loads each CSV file into R as a data frame.
details <- read_csv(details_file)
fatalities <- read_csv(fatalities_file)
locations <- read_csv(locations_file)

# 5. Joining the datasets by EVENT_ID
# The datasets are merged into one using the common EVENT_ID variable.
# This creates a comprehensive dataset with all relevant information.
joined_data <- details %>%
  left_join(locations, by = "EVENT_ID") %>%
  left_join(fatalities, by = "EVENT_ID")

# 6. Saving the joined dataset
# The combined dataset is saved as a new CSV file for future use.
output_file <- file.path(folder_path, "StormEvents_joined_data.csv")
write_csv(joined_data, output_file)

# 7. Previewing the Data
# Displays the first few rows to verify that the data has been loaded and joined correctly.
message("Joined data saved to: ", output_file)
print(head(joined_data))

## # A tibble: 6 × 70
##   BEGIN_YEARMONTH BEGIN_DAY BEGIN_TIME END_YEARMONTH END_DAY END_TIME
##             <dbl>     <dbl>      <dbl>         <dbl>   <dbl>    <dbl>
## 1          202503        31       1104        202503      31     1106
## 2          202503        30       1552        202503      30     1555
## 3          202501         5       1800        202501       6     2227
## 4          202501         3       1300        202501       3     1900
## 5          202501         3       1300        202501       3     1900
## 6          202501         3       1300        202501       3     1900
## # ℹ 64 more variables: EPISODE_ID.x <dbl>, EVENT_ID <dbl>, STATE <chr>,
## #   STATE_FIPS <dbl>, YEAR <dbl>, MONTH_NAME <chr>, EVENT_TYPE <chr>,
## #   CZ_TYPE <chr>, CZ_FIPS <dbl>, CZ_NAME <chr>, WFO <chr>,
## #   BEGIN_DATE_TIME <chr>, CZ_TIMEZONE <chr>, END_DATE_TIME <chr>,
## #   INJURIES_DIRECT <dbl>, INJURIES_INDIRECT <dbl>, DEATHS_DIRECT <dbl>,
## #   DEATHS_INDIRECT <dbl>, DAMAGE_PROPERTY <chr>, DAMAGE_CROPS <chr>,
## #   SOURCE <chr>, MAGNITUDE <dbl>, MAGNITUDE_TYPE <chr>, FLOOD_CAUSE <chr>, …

Results

Q1: Most harmful events with respect to population health

To determine which events are most harmful to population health, the direct and indirect injuries and deaths for each weather event type were aggregated to create a “total harmed” metric.

## Aggregating direct and indirect injuries and deaths
health_impact <- joined_data %>%
  group_by(EVENT_TYPE) %>%
  summarise(
    total_injuries = sum(INJURIES_DIRECT, na.rm = TRUE) + sum(INJURIES_INDIRECT, na.rm = TRUE),
    total_deaths = sum(DEATHS_DIRECT, na.rm = TRUE) + sum(DEATHS_INDIRECT, na.rm = TRUE),
    total_harmed = total_injuries + total_deaths,
    .groups = "drop"
  ) %>%
  arrange(desc(total_harmed))

head(health_impact, 10)

## # A tibble: 10 × 4
##    EVENT_TYPE        total_injuries total_deaths total_harmed
##    <chr>                      <dbl>        <dbl>        <dbl>
##  1 Flash Flood                   81        37430        37511
##  2 Wildfire                     419          873         1292
##  3 Tornado                      688          441         1129
##  4 Excessive Heat               326          444          770
##  5 Dust Storm                   417           91          508
##  6 Heat                          51          429          480
##  7 Thunderstorm Wind            158          105          263
##  8 Winter Weather               131           47          178
##  9 Lightning                    104           27          131
## 10 Winter Storm                  33           65           98

## Isolating the top 10 for the plot
top10_health <- health_impact %>% slice_max(total_harmed, n = 10)

## Plot 1: Health Impact
ggplot(top10_health, aes(x = reorder(EVENT_TYPE, total_harmed), y = total_harmed)) +
  geom_bar(stat = "identity", fill = "tomato") +
  coord_flip() +
  labs(
    title = "Top 10 Most Harmful Event Types (Health Impact)",
    x = "Event Type",
    y = "Total Harmed (Injuries + Deaths)"
  ) +
  theme_minimal()

Figure 1: A horizontal bar chart displaying the top 10 weather events that caused the highest combined number of injuries and fatalities in 2025. This highlights which events pose the greatest direct threat to human life.

Flash Flood events are by far the most harmful to population health in the United States, with a total of 37,511 combined injuries and deaths. This value is significantly higher than all other event types, indicating that flash floods pose the greatest threat to human life.

Other events such as wildfires and tornadoes also contribute to health impacts, but their effects are comparatively much smaller.

This suggests that emergency preparedness efforts should prioritize early warning systems, evacuation planning, and infrastructure improvements specifically for flash flood events.

Q2: Event Frequency by State

The data was analyzed to calculate which five states experienced the highest total volume of severe weather events in 2025. The dataset was then filtered to these top states to isolate and observe the top five most frequent specific events within each of them.

## Counting the frequency of events by state
event_counts <- joined_data %>%
  group_by(STATE, EVENT_TYPE) %>%
  summarise(total_events = n(), .groups = "drop") %>%
  arrange(desc(total_events))

head(event_counts, 10)

## # A tibble: 10 × 3
##    STATE         EVENT_TYPE        total_events
##    <chr>         <chr>                    <int>
##  1 TEXAS         Flash Flood               3337
##  2 VIRGINIA      Flash Flood               2140
##  3 ALABAMA       Thunderstorm Wind         1533
##  4 TEXAS         Hail                      1520
##  5 CALIFORNIA    Flood                     1384
##  6 PENNSYLVANIA  Flash Flood               1379
##  7 TEXAS         Thunderstorm Wind         1243
##  8 VIRGINIA      Thunderstorm Wind         1115
##  9 ARIZONA       Flash Flood               1076
## 10 WEST VIRGINIA Flash Flood               1055

## Finding the top 5 states with the most overall events
top_states <- joined_data %>%
  count(STATE) %>%
  arrange(desc(n)) %>%
  slice_max(n, n = 5) %>%
  pull(STATE)

## Filtering to the top 5 states, AND finding the top 5 events WITHIN each state
top_state_events <- event_counts %>%
  filter(STATE %in% top_states) %>%
  group_by(STATE) %>%
  slice_max(total_events, n = 5) %>%
  ungroup()

## Plot 2: State Frequency
ggplot(top_state_events, aes(x = reorder_within(EVENT_TYPE, total_events, STATE), y = total_events, fill = STATE)) +
  geom_col(show.legend = FALSE) +
  # Adding the numeric annotations to the end of each bar
  geom_text(aes(label = scales::comma(total_events)), hjust = -0.15, size = 2, color = "black") +
  # Cleaning up the axis labels and expands the scale so annotations don't get cut off
  scale_x_reordered() +
  scale_y_continuous(labels = scales::comma, expand = expansion(mult = c(0, 0.35))) +
  # Removing the ncol=1 constraint so the 5 states can wrap naturally into a nice grid
  facet_wrap(~ STATE, scales = "free_y") +
  coord_flip() +
  labs(
    title = "Top 5 Weather Events in the 5 Most Impacted States",
    x = "Event Type",
    y = "Number of Events"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position = "none",
    plot.title = element_text(face = "bold", size = 14),
    # UPDATED: Made text size 9, and added padding (margin) to make the grey box taller
    strip.text = element_text(face = "bold", size = 8, margin = margin(t = 6, r = 0, b = 6, l = 0)),
    strip.background = element_rect(fill = "gray90", color = NA),
    panel.grid.major.y = element_blank(),
    # Angles the numbers on the bottom so they don't overlap
    axis.text.x = element_text(angle = 45, hjust = 1) 
  )

Figure 2: A faceted horizontal bar chart showing the top 5 most frequent weather event types within each of the five states that recorded the highest total number of weather events in 2025.

The analysis shows that Texas experiences the highest frequency of severe weather events, particularly Flash Floods and Thunderstorm Winds.

Flash Floods are consistently among the most common event types across multiple states, including Texas, Virginia, Pennsylvania, and West Virginia.

This pattern indicates that certain regions are more prone to specific types of weather events, and resource allocation should be tailored accordingly. For example, flood prevention and water management strategies are especially important in these high-risk states.

Q3: Event characteristics by month

To understand seasonal patterns, the dataset was grouped by month, and the top three most frequent weather events for each calendar month were isolated to maintain visual readability.

## Counting events by month
monthly_counts <- joined_data %>%
  group_by(MONTH_NAME, EVENT_TYPE) %>%
  summarise(total_events = n(), .groups = "drop") %>%
  group_by(MONTH_NAME) %>%
  slice_max(total_events, n = 5) %>%
  ungroup() %>%
  arrange(match(MONTH_NAME, month.name), desc(total_events))

print(monthly_counts, n = 60)

## # A tibble: 60 × 3
##    MONTH_NAME EVENT_TYPE              total_events
##    <chr>      <chr>                          <int>
##  1 January    Winter Storm                    1122
##  2 January    Winter Weather                  1007
##  3 January    Extreme Cold/Wind Chill          816
##  4 January    Cold/Wind Chill                  643
##  5 January    Heavy Snow                       602
##  6 February   Flood                           1382
##  7 February   Winter Weather                  1372
##  8 February   Extreme Cold/Wind Chill          867
##  9 February   Flash Flood                      772
## 10 February   Winter Storm                     698
## 11 March      Thunderstorm Wind               2401
## 12 March      High Wind                       1670
## 13 March      Hail                            1087
## 14 March      Strong Wind                      342
## 15 March      Tornado                          327
## 16 April      Thunderstorm Wind               2801
## 17 April      Flash Flood                     1926
## 18 April      Hail                            1840
## 19 April      Flood                            971
## 20 April      Tornado                          588
## 21 May        Thunderstorm Wind               3969
## 22 May        Hail                            2858
## 23 May        Flash Flood                     2165
## 24 May        Flood                            809
## 25 May        Tornado                          566
## 26 June       Thunderstorm Wind               5416
## 27 June       Flash Flood                     3783
## 28 June       Hail                            1426
## 29 June       Flood                            812
## 30 June       Heat                             563
## 31 July       Flash Flood                     6608
## 32 July       Thunderstorm Wind               3855
## 33 July       Heat                            1423
## 34 July       Excessive Heat                   817
## 35 July       Flood                            810
## 36 August     Flash Flood                     2230
## 37 August     Thunderstorm Wind               1682
## 38 August     Heat                             906
## 39 August     Flood                            621
## 40 August     Hail                             500
## 41 September  Flash Flood                     1379
## 42 September  Thunderstorm Wind                807
## 43 September  Hail                             552
## 44 September  Drought                          402
## 45 September  Flood                            280
## 46 October    Flash Flood                      628
## 47 October    Drought                          432
## 48 October    Flood                            233
## 49 October    Thunderstorm Wind                227
## 50 October    Coastal Flood                    186
## 51 November   Drought                          425
## 52 November   Winter Storm                     357
## 53 November   Winter Weather                   336
## 54 November   Flood                            304
## 55 November   High Wind                        240
## 56 December   Winter Weather                  1276
## 57 December   High Wind                       1061
## 58 December   Flood                            561
## 59 December   Winter Storm                     475
## 60 December   Drought                          325

## Filtering to the top 3 events per month for readability
top_monthly_events <- monthly_counts %>%
  group_by(MONTH_NAME) %>%
  slice_max(total_events, n = 3) %>%
  ungroup()

## Ensuring chronological order for the facet wrap
top_monthly_events$MONTH_NAME <- factor(top_monthly_events$MONTH_NAME, levels = month.name)

## Plot 3: Monthly Trends
ggplot(top_monthly_events, aes(x = reorder(EVENT_TYPE, total_events), y = total_events, fill = MONTH_NAME)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ MONTH_NAME, scales = "free_y", ncol = 3) +
  coord_flip() +
  labs(
    title = "Top 3 Most Common Event Types by Month",
    x = "Event Type",
    y = "Number of Events"
  ) +
  theme_minimal() +
  theme(legend.position = "none", axis.text.y = element_text(size = 7))

Figure 3: A faceted grid chart illustrating the seasonality of severe weather, displaying the top three most frequent event types for each month of the year chronologically.

The results reveal strong seasonal patterns in severe weather events. Winter months such as January and February are dominated by cold-related events, including Winter Weather and Extreme Cold/Wind Chill.

In contrast, spring and summer months (April through August) experience a higher frequency of Thunderstorm Winds, Flash Floods, and Hail events.

The peak activity occurs during the summer months, particularly July and August, when convective storms are more frequent.

These findings suggest that emergency preparedness strategies should be adjusted seasonally, with different types of risks prioritized throughout the year.

Q4: Economic Impact: Which events cause the most severe economic impact?

Data Cleaning: The raw data uses letters (K, M, B) to represent thousands, millions, and billions. The custom convert_damage function strips out these letters and applies the correct mathematical multiplier so R can calculate actual dollar amounts.
Data Aggregation: Using dplyr, the code groups the newly cleaned data by EVENT_TYPE. It then uses summarise to add up the total property damage for each weather category and sorts them from highest to lowest.
Visualization: The slice_max command isolates just the top 10 most expensive events. Finally, ggplot builds a bar chart, using coord_flip() to turn it horizontal so the long weather event names are easy to read.

## Custom function to convert alphanumeric damage abbreviations (K, M, B) to actual numbers
convert_damage <- function(x) {
  x <- toupper(x)
  as.numeric(gsub("[KMB]", "", x)) *
    ifelse(grepl("K", x), 1e3,
           ifelse(grepl("M", x), 1e6,
                  ifelse(grepl("B", x), 1e9, 1)))
}

## Applying the conversion to the dataset
joined_data$damage_clean <- convert_damage(joined_data$DAMAGE_PROPERTY)

## Summarizing the clean damage data
damage_summary <- joined_data %>%
  group_by(EVENT_TYPE) %>%
  summarise(total_damage = sum(damage_clean, na.rm = TRUE), .groups = "drop") %>%
  arrange(desc(total_damage))

head(damage_summary, 10)

## # A tibble: 10 × 2
##    EVENT_TYPE        total_damage
##    <chr>                    <dbl>
##  1 Flash Flood        78285317100
##  2 Tornado            13264707500
##  3 Debris Flow          902091800
##  4 Wildfire             788942110
##  5 Flood                313959150
##  6 Thunderstorm Wind    260048450
##  7 Hail                  71587500
##  8 Drought               37133250
##  9 Lightning             22620150
## 10 High Wind             12212600

## Isolating the top 10 for the plot
top10_damage <- damage_summary %>% slice_max(total_damage, n = 10)

## Plot 4: Economic Impact
ggplot(top10_damage, aes(x = reorder(EVENT_TYPE, total_damage), y = total_damage)) +
  geom_bar(stat = "identity", fill = "darkblue") +
  coord_flip() +
  labs(
    title = "Top 10 Event Types by Property Damage",
    x = "Event Type",
    y = "Total Property Damage (USD)"
  ) +
  theme_minimal()

Figure 4: A horizontal bar chart detailing the top 10 weather events that resulted in the highest total property damage (in USD) across the United States in 2025.

Flash Floods are also the most economically damaging event type, causing approximately USD 78 billion in property damage. This is significantly higher than the next most damaging event, Tornadoes, which account for about USD 13 billion.

The large gap highlights the severe financial consequences associated with flooding events, likely due to widespread infrastructure damage and property loss.

These results emphasize the importance of investing in flood mitigation strategies, such as improved drainage systems, flood barriers, and land-use planning, to reduce future economic losses.