Data from the NOAA on storms throughout 2025 was compiled to answer four questions. First, the most harmful storm types to human health was determined. Next, the most frequent storm types across the United States was determined. Months were then characterized by storm types, followed lastly by determining which state is the most harmful due to storms.
First, code from the Final Project instructions document provided by Dr. Sana Spektor is ran to link the three NOAA files.
# Load necessary libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.5.3
# Define the folder path
folder_path <- "C:/Users/Andrew/Desktop/Final Project"
# Define the file paths for the unzipped CSV files
details_file <- file.path(folder_path, "StormEvents_details-ftp_v1.0_d2025_c20260323.csv")
fatalities_file <- file.path(folder_path, "StormEvents_fatalities-ftp_v1.0_d2025_c20260323.csv")
locations_file <- file.path(folder_path, "StormEvents_locations-ftp_v1.0_d2025_c20260323.csv")
# Load the CSV files into R
details <- read_csv(details_file)
## Rows: 72241 Columns: 51
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (26): STATE, MONTH_NAME, EVENT_TYPE, CZ_TYPE, CZ_NAME, WFO, BEGIN_DATE_T...
## dbl (24): BEGIN_YEARMONTH, BEGIN_DAY, BEGIN_TIME, END_YEARMONTH, END_DAY, EN...
## lgl (1): CATEGORY
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
fatalities <- read_csv(fatalities_file)
## Rows: 895 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): FATALITY_TYPE, FATALITY_DATE, FATALITY_SEX, FATALITY_LOCATION
## dbl (6): FAT_YEARMONTH, FAT_DAY, FAT_TIME, FATALITY_ID, EVENT_ID, FATALITY_AGE
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
locations <- read_csv(locations_file)
## Rows: 51870 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): AZIMUTH, LOCATION
## dbl (9): YEARMONTH, EPISODE_ID, EVENT_ID, LOCATION_INDEX, RANGE, LATITUDE, L...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Join the datasets by EVENT_ID
joined_data <- details %>%
left_join(locations, by = "EVENT_ID") %>%
left_join(fatalities, by = "EVENT_ID")
## Warning in left_join(., fatalities, by = "EVENT_ID"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1142 of `x` matches multiple rows in `y`.
## ℹ Row 729 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
# Save the joined data to a new CSV file
output_file <- file.path(folder_path, "StormEvents_joined_data.csv")
write_csv(joined_data, output_file)
# Inform the user
message("Joined data saved to: ", output_file)
## Joined data saved to: C:/Users/Andrew/Desktop/Final Project/StormEvents_joined_data.csv
# Optional: View the first few rows of the joined data
print(head(joined_data))
## # A tibble: 6 × 70
## BEGIN_YEARMONTH BEGIN_DAY BEGIN_TIME END_YEARMONTH END_DAY END_TIME
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 202503 31 1104 202503 31 1106
## 2 202503 30 1552 202503 30 1555
## 3 202501 5 1800 202501 6 2227
## 4 202501 3 1300 202501 3 1900
## 5 202501 3 1300 202501 3 1900
## 6 202501 3 1300 202501 3 1900
## # ℹ 64 more variables: EPISODE_ID.x <dbl>, EVENT_ID <dbl>, STATE <chr>,
## # STATE_FIPS <dbl>, YEAR <dbl>, MONTH_NAME <chr>, EVENT_TYPE <chr>,
## # CZ_TYPE <chr>, CZ_FIPS <dbl>, CZ_NAME <chr>, WFO <chr>,
## # BEGIN_DATE_TIME <chr>, CZ_TIMEZONE <chr>, END_DATE_TIME <chr>,
## # INJURIES_DIRECT <dbl>, INJURIES_INDIRECT <dbl>, DEATHS_DIRECT <dbl>,
## # DEATHS_INDIRECT <dbl>, DAMAGE_PROPERTY <chr>, DAMAGE_CROPS <chr>,
## # SOURCE <chr>, MAGNITUDE <dbl>, MAGNITUDE_TYPE <chr>, FLOOD_CAUSE <chr>, …
Next, the following code was used to find the most harmful storm type:
# Find total health impact for each storm type
storm_fatalities <- fatalities %>% group_by(EVENT_ID) %>% summarise(FATALITIES = n())
merged_data <- merge(storm_fatalities, joined_data[, c("EVENT_ID", "EVENT_TYPE", "INJURIES_DIRECT", "INJURIES_INDIRECT")], by = "EVENT_ID")
merged_data$health_impact <- merged_data$FATALITIES + merged_data$INJURIES_DIRECT + merged_data$INJURIES_INDIRECT
This code chunk creates a dataset with only the necessary statistics for answering question 1. Health impact was calculated by adding together the number of fatalities, the number of direct injuries, and the number of indirect injuries.
# Rank storm by total health impact
top_10_impact <- merged_data %>% group_by(EVENT_TYPE) %>% summarise(health_impact)
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'EVENT_TYPE'. You can override using the
## `.groups` argument.
top_10_impact <- aggregate(health_impact ~ EVENT_TYPE, data = merged_data, FUN = sum)
top_10_impact <- top_10_impact[order(-top_10_impact$health_impact), ][1:10, ]
This code chunk creates a dataset that ranks the top 10 most harmful storm types based on calculated health impact.
# Make plot of storm impact
ggplot(top_10_impact, aes(x = EVENT_TYPE, y = health_impact)) + geom_col() + labs(title = "Health Impact of Storm Types", x = "Storm Type", y = "Health Impact")
This code chunk creates a graph of storm types and their impact on health.
The next section of code finds the most frequent storm types across the U.S.
# Find frequency of event types
storm_locations <- joined_data %>% group_by(STATE, EVENT_TYPE) %>% summarise(count = n())
## `summarise()` has grouped output by 'STATE'. You can override using the
## `.groups` argument.
top_10_frequency <- storm_locations %>% distinct()
top_10_frequency <- top_10_frequency[order(-top_10_frequency$count), ][1:10, ]
# Print table of top 10 event frequencies
print(top_10_frequency)
## # A tibble: 10 × 3
## # Groups: STATE [7]
## STATE EVENT_TYPE count
## <chr> <chr> <int>
## 1 TEXAS Flash Flood 3337
## 2 VIRGINIA Flash Flood 2140
## 3 ALABAMA Thunderstorm Wind 1533
## 4 TEXAS Hail 1520
## 5 CALIFORNIA Flood 1384
## 6 PENNSYLVANIA Flash Flood 1379
## 7 TEXAS Thunderstorm Wind 1243
## 8 VIRGINIA Thunderstorm Wind 1115
## 9 ARIZONA Flash Flood 1076
## 10 WEST VIRGINIA Flash Flood 1055
This code chunk organizes, stores, and ranks storm frequencies across several states into a separate dataset, and prints the result.
The following code characterizes months by storm types:
# Find relationship between months and events
months_and_events <- merge(joined_data[, c("EVENT_ID", "MONTH_NAME")], joined_data[, c("EVENT_ID", "EVENT_TYPE")])
months_and_events <- months_and_events %>% mutate(MONTH_NAME = factor(MONTH_NAME, levels = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")))
months_and_events <- months_and_events %>% group_by(MONTH_NAME, EVENT_TYPE) %>% summarise(count = n()) %>% arrange(desc(count))
## `summarise()` has grouped output by 'MONTH_NAME'. You can override using the
## `.groups` argument.
Only necessary data for the task is stored into a dataset. “MONTH_NAME” is then converted into a factor so that the graph begins at January, instead of months being ordered alphabetically.
ggplot(months_and_events, aes(x = MONTH_NAME, y = count, fill = EVENT_TYPE)) + geom_col() + labs(title = "Storm Type Frequency by Month", x = "Month", y = "Frequency")
This code chunk produces a graph of storm type frequency by month.
The following code lists which states have the highest health impact:
# Find which state had the highest health impact
merged_data <- merge(merged_data, joined_data[, c("EVENT_ID", "STATE")])
top_10_state <- aggregate(health_impact ~ STATE, data = merged_data, FUN = sum)
top_10_state <- top_10_state[order(-top_10_state$health_impact), ][1:10, ]
print(top_10_state)
## STATE health_impact
## 46 TEXAS 11773346
## 50 WEST VIRGINIA 46685
## 7 CALIFORNIA 20409
## 19 KENTUCKY 5014
## 30 NEVADA 4713
## 18 KANSAS 3425
## 27 MISSOURI 2866
## 4 ARIZONA 2544
## 33 NEW MEXICO 1737
## 45 TENNESSEE 631
State data is added to the earlier merged data containing health impact information. A new dataset is then created and printed, ranking the top 10 states with the highest health impact.
ggplot(top_10_impact, aes(x = EVENT_TYPE, y = health_impact)) + geom_col() + labs(title = "Health Impact of Storm Types", x = "Storm Type", y = "Health Impact")
Health impact of different storm types.
According to the graph, flash floods were significantly more harmful than any other storm type.
print(top_10_frequency)
## # A tibble: 10 × 3
## # Groups: STATE [7]
## STATE EVENT_TYPE count
## <chr> <chr> <int>
## 1 TEXAS Flash Flood 3337
## 2 VIRGINIA Flash Flood 2140
## 3 ALABAMA Thunderstorm Wind 1533
## 4 TEXAS Hail 1520
## 5 CALIFORNIA Flood 1384
## 6 PENNSYLVANIA Flash Flood 1379
## 7 TEXAS Thunderstorm Wind 1243
## 8 VIRGINIA Thunderstorm Wind 1115
## 9 ARIZONA Flash Flood 1076
## 10 WEST VIRGINIA Flash Flood 1055
According to the table, flash floods in Texas were the most frequent storm type.
ggplot(months_and_events, aes(x = MONTH_NAME, y = count, fill = EVENT_TYPE)) + geom_col() + labs(title = "Storm Type Frequency by Month", x = "Month", y = "Frequency", fill = "Storm Type")
Storm types characterized by months.
According to the graph, July characterizes flash floods.
print(top_10_state)
## STATE health_impact
## 46 TEXAS 11773346
## 50 WEST VIRGINIA 46685
## 7 CALIFORNIA 20409
## 19 KENTUCKY 5014
## 30 NEVADA 4713
## 18 KANSAS 3425
## 27 MISSOURI 2866
## 4 ARIZONA 2544
## 33 NEW MEXICO 1737
## 45 TENNESSEE 631
According to the table, Texas is the most harmful state to human health due to storms.