Final Project

Synopsis

Data from the NOAA on storms throughout 2025 was compiled to answer four questions. First, the most harmful storm types to human health was determined. Next, the most frequent storm types across the United States was determined. Months were then characterized by storm types, followed lastly by determining which state is the most harmful due to storms.

Data Processing

First, code from the Final Project instructions document provided by Dr. Sana Spektor is ran to link the three NOAA files.

# Load necessary libraries
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readr)
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.5.3

# Define the folder path
folder_path <- "C:/Users/Andrew/Desktop/Final Project"

# Define the file paths for the unzipped CSV files
details_file <- file.path(folder_path, "StormEvents_details-ftp_v1.0_d2025_c20260323.csv")
fatalities_file <- file.path(folder_path, "StormEvents_fatalities-ftp_v1.0_d2025_c20260323.csv")
locations_file <- file.path(folder_path, "StormEvents_locations-ftp_v1.0_d2025_c20260323.csv")

# Load the CSV files into R
details <- read_csv(details_file)

## Rows: 72241 Columns: 51

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (26): STATE, MONTH_NAME, EVENT_TYPE, CZ_TYPE, CZ_NAME, WFO, BEGIN_DATE_T...
## dbl (24): BEGIN_YEARMONTH, BEGIN_DAY, BEGIN_TIME, END_YEARMONTH, END_DAY, EN...
## lgl  (1): CATEGORY
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

fatalities <- read_csv(fatalities_file)

## Rows: 895 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): FATALITY_TYPE, FATALITY_DATE, FATALITY_SEX, FATALITY_LOCATION
## dbl (6): FAT_YEARMONTH, FAT_DAY, FAT_TIME, FATALITY_ID, EVENT_ID, FATALITY_AGE
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

locations <- read_csv(locations_file)

## Rows: 51870 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): AZIMUTH, LOCATION
## dbl (9): YEARMONTH, EPISODE_ID, EVENT_ID, LOCATION_INDEX, RANGE, LATITUDE, L...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Join the datasets by EVENT_ID
joined_data <- details %>%
  left_join(locations, by = "EVENT_ID") %>%
  left_join(fatalities, by = "EVENT_ID")

## Warning in left_join(., fatalities, by = "EVENT_ID"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1142 of `x` matches multiple rows in `y`.
## ℹ Row 729 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

# Save the joined data to a new CSV file
output_file <- file.path(folder_path, "StormEvents_joined_data.csv")
write_csv(joined_data, output_file)

# Inform the user
message("Joined data saved to: ", output_file)

## Joined data saved to: C:/Users/Andrew/Desktop/Final Project/StormEvents_joined_data.csv

# Optional: View the first few rows of the joined data
print(head(joined_data))

## # A tibble: 6 × 70
##   BEGIN_YEARMONTH BEGIN_DAY BEGIN_TIME END_YEARMONTH END_DAY END_TIME
##             <dbl>     <dbl>      <dbl>         <dbl>   <dbl>    <dbl>
## 1          202503        31       1104        202503      31     1106
## 2          202503        30       1552        202503      30     1555
## 3          202501         5       1800        202501       6     2227
## 4          202501         3       1300        202501       3     1900
## 5          202501         3       1300        202501       3     1900
## 6          202501         3       1300        202501       3     1900
## # ℹ 64 more variables: EPISODE_ID.x <dbl>, EVENT_ID <dbl>, STATE <chr>,
## #   STATE_FIPS <dbl>, YEAR <dbl>, MONTH_NAME <chr>, EVENT_TYPE <chr>,
## #   CZ_TYPE <chr>, CZ_FIPS <dbl>, CZ_NAME <chr>, WFO <chr>,
## #   BEGIN_DATE_TIME <chr>, CZ_TIMEZONE <chr>, END_DATE_TIME <chr>,
## #   INJURIES_DIRECT <dbl>, INJURIES_INDIRECT <dbl>, DEATHS_DIRECT <dbl>,
## #   DEATHS_INDIRECT <dbl>, DAMAGE_PROPERTY <chr>, DAMAGE_CROPS <chr>,
## #   SOURCE <chr>, MAGNITUDE <dbl>, MAGNITUDE_TYPE <chr>, FLOOD_CAUSE <chr>, …

Next, the following code was used to find the most harmful storm type:

# Find total health impact for each storm type
storm_fatalities <- fatalities %>% group_by(EVENT_ID) %>% summarise(FATALITIES = n())

merged_data <- merge(storm_fatalities, joined_data[, c("EVENT_ID", "EVENT_TYPE", "INJURIES_DIRECT", "INJURIES_INDIRECT")], by = "EVENT_ID")

merged_data$health_impact <- merged_data$FATALITIES + merged_data$INJURIES_DIRECT + merged_data$INJURIES_INDIRECT

This code chunk creates a dataset with only the necessary statistics for answering question 1. Health impact was calculated by adding together the number of fatalities, the number of direct injuries, and the number of indirect injuries.

# Rank storm by total health impact
top_10_impact <- merged_data %>% group_by(EVENT_TYPE) %>% summarise(health_impact)

## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `summarise()` has grouped output by 'EVENT_TYPE'. You can override using the
## `.groups` argument.

top_10_impact <- aggregate(health_impact ~ EVENT_TYPE, data = merged_data, FUN = sum)

top_10_impact <- top_10_impact[order(-top_10_impact$health_impact), ][1:10, ]

This code chunk creates a dataset that ranks the top 10 most harmful storm types based on calculated health impact.

# Make plot of storm impact
ggplot(top_10_impact, aes(x = EVENT_TYPE, y = health_impact)) + geom_col() + labs(title = "Health Impact of Storm Types", x = "Storm Type", y = "Health Impact")

This code chunk creates a graph of storm types and their impact on health.

The next section of code finds the most frequent storm types across the U.S.

# Find frequency of event types
storm_locations <- joined_data %>% group_by(STATE, EVENT_TYPE) %>% summarise(count = n())

## `summarise()` has grouped output by 'STATE'. You can override using the
## `.groups` argument.

top_10_frequency <- storm_locations %>% distinct()

top_10_frequency <- top_10_frequency[order(-top_10_frequency$count), ][1:10, ]

# Print table of top 10 event frequencies
print(top_10_frequency)

## # A tibble: 10 × 3
## # Groups:   STATE [7]
##    STATE         EVENT_TYPE        count
##    <chr>         <chr>             <int>
##  1 TEXAS         Flash Flood        3337
##  2 VIRGINIA      Flash Flood        2140
##  3 ALABAMA       Thunderstorm Wind  1533
##  4 TEXAS         Hail               1520
##  5 CALIFORNIA    Flood              1384
##  6 PENNSYLVANIA  Flash Flood        1379
##  7 TEXAS         Thunderstorm Wind  1243
##  8 VIRGINIA      Thunderstorm Wind  1115
##  9 ARIZONA       Flash Flood        1076
## 10 WEST VIRGINIA Flash Flood        1055

This code chunk organizes, stores, and ranks storm frequencies across several states into a separate dataset, and prints the result.

The following code characterizes months by storm types:

# Find relationship between months and events
months_and_events <- merge(joined_data[, c("EVENT_ID", "MONTH_NAME")], joined_data[, c("EVENT_ID", "EVENT_TYPE")])

months_and_events <- months_and_events %>% mutate(MONTH_NAME = factor(MONTH_NAME, levels = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")))

months_and_events <- months_and_events %>% group_by(MONTH_NAME, EVENT_TYPE) %>% summarise(count = n()) %>% arrange(desc(count))

## `summarise()` has grouped output by 'MONTH_NAME'. You can override using the
## `.groups` argument.

Only necessary data for the task is stored into a dataset. “MONTH_NAME” is then converted into a factor so that the graph begins at January, instead of months being ordered alphabetically.

ggplot(months_and_events, aes(x = MONTH_NAME, y = count, fill = EVENT_TYPE)) + geom_col() + labs(title = "Storm Type Frequency by Month", x = "Month", y = "Frequency")

This code chunk produces a graph of storm type frequency by month.

The following code lists which states have the highest health impact:

# Find which state had the highest health impact
merged_data <- merge(merged_data, joined_data[, c("EVENT_ID", "STATE")])

top_10_state <- aggregate(health_impact ~ STATE, data = merged_data, FUN = sum)

top_10_state <- top_10_state[order(-top_10_state$health_impact), ][1:10, ]

print(top_10_state)

##            STATE health_impact
## 46         TEXAS      11773346
## 50 WEST VIRGINIA         46685
## 7     CALIFORNIA         20409
## 19      KENTUCKY          5014
## 30        NEVADA          4713
## 18        KANSAS          3425
## 27      MISSOURI          2866
## 4        ARIZONA          2544
## 33    NEW MEXICO          1737
## 45     TENNESSEE           631

State data is added to the earlier merged data containing health impact information. A new dataset is then created and printed, ranking the top 10 states with the highest health impact.

Results

What are the most harmful storm types?

ggplot(top_10_impact, aes(x = EVENT_TYPE, y = health_impact)) + geom_col() + labs(title = "Health Impact of Storm Types", x = "Storm Type", y = "Health Impact")

Health impact of different storm types.

According to the graph, flash floods were significantly more harmful than any other storm type.

What are the most frequent storm types across the U.S.?

print(top_10_frequency)

## # A tibble: 10 × 3
## # Groups:   STATE [7]
##    STATE         EVENT_TYPE        count
##    <chr>         <chr>             <int>
##  1 TEXAS         Flash Flood        3337
##  2 VIRGINIA      Flash Flood        2140
##  3 ALABAMA       Thunderstorm Wind  1533
##  4 TEXAS         Hail               1520
##  5 CALIFORNIA    Flood              1384
##  6 PENNSYLVANIA  Flash Flood        1379
##  7 TEXAS         Thunderstorm Wind  1243
##  8 VIRGINIA      Thunderstorm Wind  1115
##  9 ARIZONA       Flash Flood        1076
## 10 WEST VIRGINIA Flash Flood        1055

According to the table, flash floods in Texas were the most frequent storm type.

Which months characterize storm types?

ggplot(months_and_events, aes(x = MONTH_NAME, y = count, fill = EVENT_TYPE)) + geom_col() + labs(title = "Storm Type Frequency by Month", x = "Month", y = "Frequency", fill = "Storm Type")

Storm types characterized by months.

According to the graph, July characterizes flash floods.

Which state is most harmful to human health due to storms?

print(top_10_state)

##            STATE health_impact
## 46         TEXAS      11773346
## 50 WEST VIRGINIA         46685
## 7     CALIFORNIA         20409
## 19      KENTUCKY          5014
## 30        NEVADA          4713
## 18        KANSAS          3425
## 27      MISSOURI          2866
## 4        ARIZONA          2544
## 33    NEW MEXICO          1737
## 45     TENNESSEE           631

According to the table, Texas is the most harmful state to human health due to storms.

Final Project

Andrew Wang

2026-05-02

Synopsis

Data Processing

Results