Synopsis

This document intends to summarize the cleaning and analysis of the 2025 NOAA Storm Events database, which takes data from severe weather events across the United States starting at the beginning of 2025 til present day. This analysis holds three main datasets (details, fatalities, and locations), and joins them together for said analysis. The merged dataset is cleaned and then edited to see trends in the data that answer questions of most harmful storm, storm density based on location, time of year for weather events, and sources used to report severe weather events.

Data Processing

To begin, all packages necessary are imported and installed.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
## Warning: package 'readr' was built under R version 4.5.2
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.5.2
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
## 
##     col_factor
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.5.2
library(forcats)
library(knitr)
## Warning: package 'knitr' was built under R version 4.5.2
library(stringr)
## Warning: package 'stringr' was built under R version 4.5.2

The working directory and file path is then set, and the raw data files of details, fatalities and locations are read in. They are then given names “details”, “fatalities”, and “locations”.

folder_path <- "C:/Users/Kathr/Desktop/DAT511/New folder (3)"
setwd("C:/Users/Kathr/Desktop/DAT511/New folder (3)")
getwd()
## [1] "C:/Users/Kathr/Desktop/DAT511/New folder (3)"
details_file <- file.path(folder_path, "StormEvents_details-ftp_v1.0_d2025_c20251118.csv.gz")
fatalities_file <- file.path(folder_path, "StormEvents_fatalities-ftp_v1.0_d2025_c20251118.csv.gz")
locations_file <- file.path(folder_path, "StormEvents_locations-ftp_v1.0_d2025_c20251118 (1).csv.gz")

details <- read_csv(details_file)
## Rows: 44721 Columns: 51
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (26): STATE, MONTH_NAME, EVENT_TYPE, CZ_TYPE, CZ_NAME, WFO, BEGIN_DATE_T...
## dbl (24): BEGIN_YEARMONTH, BEGIN_DAY, BEGIN_TIME, END_YEARMONTH, END_DAY, EN...
## lgl  (1): CATEGORY
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
fatalities <- read_csv(fatalities_file)
## Rows: 423 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): FATALITY_TYPE, FATALITY_DATE, FATALITY_SEX, FATALITY_LOCATION
## dbl (7): FAT_YEARMONTH, FAT_DAY, FAT_TIME, FATALITY_ID, EVENT_ID, FATALITY_A...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
locations <- read_csv(locations_file)
## Rows: 31130 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): AZIMUTH, LOCATION
## dbl (9): YEARMONTH, EPISODE_ID, EVENT_ID, LOCATION_INDEX, RANGE, LATITUDE, L...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The datasets are then merged to get one useable “joined_data” dataset by the shared variable name “EVENT_ID”. A message is then printed to mention joined_data was saved, and the option of printing some of the data is provided.

joined_data <- details %>%
  left_join(locations, by = "EVENT_ID") %>%
  left_join(fatalities, by = "EVENT_ID")
## Warning in left_join(., fatalities, by = "EVENT_ID"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 896 of `x` matches multiple rows in `y`.
## ℹ Row 325 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
output_file <- file.path(folder_path, "StormEvents_joined_data.csv")
write_csv(joined_data, output_file)

message("Joined data saved to: ", output_file)
## Joined data saved to: C:/Users/Kathr/Desktop/DAT511/New folder (3)/StormEvents_joined_data.csv
print(head(joined_data))
## # A tibble: 6 × 71
##   BEGIN_YEARMONTH BEGIN_DAY BEGIN_TIME END_YEARMONTH END_DAY END_TIME
##             <dbl>     <dbl>      <dbl>         <dbl>   <dbl>    <dbl>
## 1          202503        31       1104        202503      31     1106
## 2          202503        30       1552        202503      30     1555
## 3          202501         5       1800        202501       6     2227
## 4          202501         3       1300        202501       3     1900
## 5          202501         3       1300        202501       3     1900
## 6          202501         3       1300        202501       3     1900
## # ℹ 65 more variables: EPISODE_ID.x <dbl>, EVENT_ID <dbl>, STATE <chr>,
## #   STATE_FIPS <dbl>, YEAR <dbl>, MONTH_NAME <chr>, EVENT_TYPE <chr>,
## #   CZ_TYPE <chr>, CZ_FIPS <dbl>, CZ_NAME <chr>, WFO <chr>,
## #   BEGIN_DATE_TIME <chr>, CZ_TIMEZONE <chr>, END_DATE_TIME <chr>,
## #   INJURIES_DIRECT <dbl>, INJURIES_INDIRECT <dbl>, DEATHS_DIRECT <dbl>,
## #   DEATHS_INDIRECT <dbl>, DAMAGE_PROPERTY <chr>, DAMAGE_CROPS <chr>,
## #   SOURCE <chr>, MAGNITUDE <dbl>, MAGNITUDE_TYPE <chr>, FLOOD_CAUSE <chr>, …

Joined_data is then cleaned (removed redundency) and is named “data”.

joined_data_clean <- joined_data %>%
  select(-EPISODE_ID.y, -YEARMONTH, -LOCATION_INDEX)
names(joined_data_clean)
##  [1] "BEGIN_YEARMONTH"    "BEGIN_DAY"          "BEGIN_TIME"        
##  [4] "END_YEARMONTH"      "END_DAY"            "END_TIME"          
##  [7] "EPISODE_ID.x"       "EVENT_ID"           "STATE"             
## [10] "STATE_FIPS"         "YEAR"               "MONTH_NAME"        
## [13] "EVENT_TYPE"         "CZ_TYPE"            "CZ_FIPS"           
## [16] "CZ_NAME"            "WFO"                "BEGIN_DATE_TIME"   
## [19] "CZ_TIMEZONE"        "END_DATE_TIME"      "INJURIES_DIRECT"   
## [22] "INJURIES_INDIRECT"  "DEATHS_DIRECT"      "DEATHS_INDIRECT"   
## [25] "DAMAGE_PROPERTY"    "DAMAGE_CROPS"       "SOURCE"            
## [28] "MAGNITUDE"          "MAGNITUDE_TYPE"     "FLOOD_CAUSE"       
## [31] "CATEGORY"           "TOR_F_SCALE"        "TOR_LENGTH"        
## [34] "TOR_WIDTH"          "TOR_OTHER_WFO"      "TOR_OTHER_CZ_STATE"
## [37] "TOR_OTHER_CZ_FIPS"  "TOR_OTHER_CZ_NAME"  "BEGIN_RANGE"       
## [40] "BEGIN_AZIMUTH"      "BEGIN_LOCATION"     "END_RANGE"         
## [43] "END_AZIMUTH"        "END_LOCATION"       "BEGIN_LAT"         
## [46] "BEGIN_LON"          "END_LAT"            "END_LON"           
## [49] "EPISODE_NARRATIVE"  "EVENT_NARRATIVE"    "DATA_SOURCE"       
## [52] "RANGE"              "AZIMUTH"            "LOCATION"          
## [55] "LATITUDE"           "LONGITUDE"          "LAT2"              
## [58] "LON2"               "FAT_YEARMONTH"      "FAT_DAY"           
## [61] "FAT_TIME"           "FATALITY_ID"        "FATALITY_TYPE"     
## [64] "FATALITY_DATE"      "FATALITY_AGE"       "FATALITY_SEX"      
## [67] "FATALITY_LOCATION"  "EVENT_YEARMONTH"
data <- joined_data_clean

Then new variables are made through cleaning, and added together to get total_injuries, total_deaths, health_impact, and economic_damage. These are used for our analysis.

data <- data %>%
  mutate(
    INJURIES_DIRECT  = as.numeric(INJURIES_DIRECT  ), INJURIES_DIRECT  = replace_na(INJURIES_DIRECT, 0),
    INJURIES_INDIRECT= as.numeric(INJURIES_INDIRECT), INJURIES_INDIRECT= replace_na(INJURIES_INDIRECT, 0),
    DEATHS_DIRECT    = as.numeric(DEATHS_DIRECT    ), DEATHS_DIRECT    = replace_na(DEATHS_DIRECT, 0),
    DEATHS_INDIRECT  = as.numeric(DEATHS_INDIRECT  ), DEATHS_INDIRECT  = replace_na(DEATHS_INDIRECT, 0),
    DAMAGE_PROPERTY  = as.numeric(DAMAGE_PROPERTY  ), DAMAGE_PROPERTY  = replace_na(DAMAGE_PROPERTY, 0),
    DAMAGE_CROPS     = as.numeric(DAMAGE_CROPS     ), DAMAGE_CROPS     = replace_na(DAMAGE_CROPS, 0)
  ) %>%

  mutate(
    total_injuries = INJURIES_DIRECT + INJURIES_INDIRECT,
    total_deaths   = DEATHS_DIRECT   + DEATHS_INDIRECT,
    health_impact  = total_injuries + total_deaths,
    economic_damage = DAMAGE_PROPERTY + DAMAGE_CROPS
  )
## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `DAMAGE_PROPERTY = as.numeric(DAMAGE_PROPERTY)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.

Ensuring that there is a “period of time” value present to help analysis, the data is further cleaned.

if ("MONTH_NAME" %in% names(data)) {
  data <- data %>% mutate(month = factor(MONTH_NAME, levels = month.name))
} else if ("BEGIN_DATE_TIME" %in% names(data)) {
  data <- data %>% mutate(month = factor(month(ymd_hms(BEGIN_DATE_TIME), 
                        label = TRUE, abbr = FALSE), levels = month.name))
} else {
  data <- data %>% mutate(month = NA)
}

data <- data %>%
  mutate(EVENT_TYPE = str_to_title(trimws(as.character(EVENT_TYPE))))

Results

In our first section of the analysis, we are asking the question “Across the United States, which types of events (as indicated in the EVENT_TYPE variable) are most harmful with respect to population health?” The data shows that the top three most harm full events are Flash Floods with 1356 casualties, Wildfires with 872 casualites, and Tornados with 162 causalites. A plot is then created showing the top 10 most harmful events.

health_by_event <- data %>%
  group_by(EVENT_TYPE) %>%
  summarise(
    total_deaths = sum(total_deaths,   na.rm = TRUE),
    total_injuries = sum(total_injuries, na.rm = TRUE),
    total_health = sum(health_impact, na.rm = TRUE),
    n_events = n()
  ) %>%
  arrange(desc(total_health))

top10_health <- health_by_event %>% slice_max(total_health, n = 10)
print(top10_health)
## # A tibble: 10 × 5
##    EVENT_TYPE        total_deaths total_injuries total_health n_events
##    <chr>                    <dbl>          <dbl>        <dbl>    <int>
##  1 Flash Flood               1356             12         1368     9338
##  2 Wildfire                   872            413         1285      214
##  3 Tornado                    162            730          892     2012
##  4 Dust Storm                  91            417          508      237
##  5 Excessive Heat             309            172          481      438
##  6 Heat                       173             33          206      538
##  7 Thunderstorm Wind           91            107          198    14432
##  8 Winter Storm                63             23           86     2102
##  9 Lightning                   14             57           71      120
## 10 Winter Weather              27             44           71     2775
data_plot_harmevents <- top10_health %>% slice_max(total_health, n = 10) %>% arrange(total_health)
ggplot(data_plot_harmevents, aes(x = fct_reorder(EVENT_TYPE, total_health), y = total_health)) +
  geom_col() +
  coord_flip() +
  labs(title = "Top Event Types by Total Health Impact (Deaths + Injuries)",
       x = "Event Type", y = "Total health impact (count)") +
  scale_y_continuous(labels = comma) +
  theme_minimal()

## # A tibble: 10 × 5
##    EVENT_TYPE        total_deaths total_injuries total_health n_events
##    <chr>                    <dbl>          <dbl>        <dbl>    <int>
##  1 Lightning                   14             57           71      120
##  2 Winter Weather              27             44           71     2775
##  3 Winter Storm                63             23           86     2102
##  4 Thunderstorm Wind           91            107          198    14432
##  5 Heat                       173             33          206      538
##  6 Excessive Heat             309            172          481      438
##  7 Dust Storm                  91            417          508      237
##  8 Tornado                    162            730          892     2012
##  9 Wildfire                   872            413         1285      214
## 10 Flash Flood               1356             12         1368     9338

The second part of our analysis asks “Across the United States, which types of events are most happening in which states?” Through the analysis, we are able to see which events happen and how frequently per state. The heat map plot demonstrates the top ten event types and in which states they occur with frequency. The message script states that the original database lacks any data for economic factors, and R will display 0s.

events_by_state_type <- data %>%
  group_by(STATE, EVENT_TYPE) %>%
  summarise(n_events = n(), total_health = sum(health_impact, na.rm = TRUE), total_econ = sum(economic_damage, na.rm = TRUE)) %>%
  ungroup()
## `summarise()` has grouped output by 'STATE'. You can override using the
## `.groups` argument.
top_event_per_state <- events_by_state_type %>%
  group_by(STATE) %>%
  slice_max(n_events, n = 1, with_ties = FALSE) %>%
  arrange(STATE)

print(top_event_per_state %>% slice(1:10))
## # A tibble: 67 × 5
## # Groups:   STATE [67]
##    STATE          EVENT_TYPE               n_events total_health total_econ
##    <chr>          <chr>                       <int>        <dbl>      <dbl>
##  1 ALABAMA        Thunderstorm Wind            1091           15          0
##  2 ALASKA         Blizzard                       39            0          0
##  3 AMERICAN SAMOA Flash Flood                    41            0          0
##  4 ARIZONA        Flash Flood                    73            0          0
##  5 ARKANSAS       Thunderstorm Wind             247            7          0
##  6 ATLANTIC NORTH Marine Thunderstorm Wind      364            0          0
##  7 ATLANTIC SOUTH Marine Thunderstorm Wind      337            0          0
##  8 CALIFORNIA     Flood                         601            6          0
##  9 COLORADO       Hail                          267            0          0
## 10 CONNECTICUT    Winter Weather                 23            0          0
## # ℹ 57 more rows
message("NOAA currently has no data for total_econ in 2025, which is why it is 0")
## NOAA currently has no data for total_econ in 2025, which is why it is 0
top10_event_types <- events_by_state_type %>%
  group_by(EVENT_TYPE) %>% summarise(total = sum(n_events)) %>%
  arrange(desc(total)) %>% slice(1:10) %>% pull(EVENT_TYPE)

heatmap_top10 <- events_by_state_type %>%
  filter(EVENT_TYPE %in% top10_event_types)

state_order <- heatmap_top10 %>% group_by(STATE) %>% summarise(s = sum(n_events)) %>% arrange(desc(s)) %>% pull(STATE)
heatmap_data <- heatmap_top10 %>% mutate(STATE = factor(STATE, levels = state_order))

ggplot(heatmap_top10, aes(x = STATE, y = fct_rev(factor(EVENT_TYPE)), fill = n_events)) +
  geom_tile(color = "white") +
  labs(title = "Counts of Top 10 Event Types by State",
       x = "State", y = "Event Type", fill = "Event count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

## # A tibble: 413 × 5
##    STATE   EVENT_TYPE              n_events total_health total_econ
##    <chr>   <chr>                      <int>        <dbl>      <dbl>
##  1 ALABAMA Drought                       23            0          0
##  2 ALABAMA Extreme Cold/Wind Chill        6            0          0
##  3 ALABAMA Flash Flood                  124            4          0
##  4 ALABAMA Flood                         13            0          0
##  5 ALABAMA Hail                         175            0          0
##  6 ALABAMA Thunderstorm Wind           1091           15          0
##  7 ALABAMA Tornado                      108           28          0
##  8 ALABAMA Winter Storm                  19            1          0
##  9 ALABAMA Winter Weather                16            0          0
## 10 ALASKA  Extreme Cold/Wind Chill        3            0          0
## # ℹ 403 more rows

The third section of the analysis tests “Which types of events are characterized by which months?” The script creates a graph that shows the raising an lowering of certain event types based on time of year. For example, some event types such as thunderstorms become more frequent as the year aproaches summer. The plot below visualizes this.

top_types_by_count <- data %>% count(EVENT_TYPE, sort = TRUE) %>% slice_max(n, n = 5) %>% pull(EVENT_TYPE)

month_plot_data <- data %>%
  filter(EVENT_TYPE %in% top_types_by_count, !is.na(month)) %>%
  group_by(month, EVENT_TYPE) %>%
  summarise(n_events = n(), avg_health = mean(health_impact, na.rm = TRUE)) %>%
  ungroup()
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
month_plot_data$month <- factor(month_plot_data$month, levels = month.name)

ggplot(month_plot_data, aes(x = month, y = n_events, color = EVENT_TYPE, group = EVENT_TYPE)) +
  geom_line(size = 1) + geom_point() +
  labs(title = "Monthly Counts for Top Event Types",
       x = "Month", y = "Event Count") +
  theme_minimal() +
  theme(legend.position = "bottom", axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## # A tibble: 30 × 4
##    month    EVENT_TYPE        n_events avg_health
##    <fct>    <chr>                <int>      <dbl>
##  1 January  Flash Flood            253    0      
##  2 January  Flood                  569    0      
##  3 January  Hail                     4    0      
##  4 January  High Wind              282    0.00709
##  5 January  Thunderstorm Wind       56    0      
##  6 February Flash Flood            770    0.0234 
##  7 February Flood                 1381    0.00724
##  8 February Hail                    43    0      
##  9 February High Wind              575    0.00696
## 10 February Thunderstorm Wind      583    0.00343
## # ℹ 20 more rows

In the final section of our analysis, we asks “Which source is most used to report events?” Based on the data, the “Public” variable, which essentially means word of mouth or shared to the public, is the main source of event reporting with 10,495 documented reports. The bar chart then demonstrates the five most used sources for event notice.

source_counts <- data %>%
  group_by(SOURCE) %>%
  summarise(
    n_events = n()
  ) %>%
  arrange(desc(n_events))

top5_sources <- source_counts %>% slice_max(n_events, n = 5)
print(top5_sources)
## # A tibble: 5 × 2
##   SOURCE            n_events
##   <chr>                <int>
## 1 Public               10495
## 2 Emergency Manager     6846
## 3 Mesonet               5261
## 4 911 Call Center       4913
## 5 Trained Spotter       3725
ggplot(top5_sources, aes(x = fct_reorder(SOURCE, n_events), y = n_events)) +
  geom_col(fill = "darkorange") +
  coord_flip() +
  labs(
    title = "Top 5 Reporting Sources for NOAA Storm Events",
    x = "Reporting Source",
    y = "Number of Events"
  ) +
  theme_minimal() +
  scale_y_continuous(labels = scales::comma)

## # A tibble: 5 × 2
##   SOURCE            n_events
##   <chr>                <int>
## 1 Public               10495
## 2 Emergency Manager     6846
## 3 Mesonet               5261
## 4 911 Call Center       4913
## 5 Trained Spotter       3725