This document intends to summarize the cleaning and analysis of the 2025 NOAA Storm Events database, which takes data from severe weather events across the United States starting at the beginning of 2025 til present day. This analysis holds three main datasets (details, fatalities, and locations), and joins them together for said analysis. The merged dataset is cleaned and then edited to see trends in the data that answer questions of most harmful storm, storm density based on location, time of year for weather events, and sources used to report severe weather events.
To begin, all packages necessary are imported and installed.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
## Warning: package 'readr' was built under R version 4.5.2
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.5.2
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
##
## col_factor
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.5.2
library(forcats)
library(knitr)
## Warning: package 'knitr' was built under R version 4.5.2
library(stringr)
## Warning: package 'stringr' was built under R version 4.5.2
The working directory and file path is then set, and the raw data files of details, fatalities and locations are read in. They are then given names “details”, “fatalities”, and “locations”.
folder_path <- "C:/Users/Kathr/Desktop/DAT511/New folder (3)"
setwd("C:/Users/Kathr/Desktop/DAT511/New folder (3)")
getwd()
## [1] "C:/Users/Kathr/Desktop/DAT511/New folder (3)"
details_file <- file.path(folder_path, "StormEvents_details-ftp_v1.0_d2025_c20251118.csv.gz")
fatalities_file <- file.path(folder_path, "StormEvents_fatalities-ftp_v1.0_d2025_c20251118.csv.gz")
locations_file <- file.path(folder_path, "StormEvents_locations-ftp_v1.0_d2025_c20251118 (1).csv.gz")
details <- read_csv(details_file)
## Rows: 44721 Columns: 51
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (26): STATE, MONTH_NAME, EVENT_TYPE, CZ_TYPE, CZ_NAME, WFO, BEGIN_DATE_T...
## dbl (24): BEGIN_YEARMONTH, BEGIN_DAY, BEGIN_TIME, END_YEARMONTH, END_DAY, EN...
## lgl (1): CATEGORY
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
fatalities <- read_csv(fatalities_file)
## Rows: 423 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): FATALITY_TYPE, FATALITY_DATE, FATALITY_SEX, FATALITY_LOCATION
## dbl (7): FAT_YEARMONTH, FAT_DAY, FAT_TIME, FATALITY_ID, EVENT_ID, FATALITY_A...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
locations <- read_csv(locations_file)
## Rows: 31130 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): AZIMUTH, LOCATION
## dbl (9): YEARMONTH, EPISODE_ID, EVENT_ID, LOCATION_INDEX, RANGE, LATITUDE, L...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The datasets are then merged to get one useable “joined_data” dataset by the shared variable name “EVENT_ID”. A message is then printed to mention joined_data was saved, and the option of printing some of the data is provided.
joined_data <- details %>%
left_join(locations, by = "EVENT_ID") %>%
left_join(fatalities, by = "EVENT_ID")
## Warning in left_join(., fatalities, by = "EVENT_ID"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 896 of `x` matches multiple rows in `y`.
## ℹ Row 325 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
output_file <- file.path(folder_path, "StormEvents_joined_data.csv")
write_csv(joined_data, output_file)
message("Joined data saved to: ", output_file)
## Joined data saved to: C:/Users/Kathr/Desktop/DAT511/New folder (3)/StormEvents_joined_data.csv
print(head(joined_data))
## # A tibble: 6 × 71
## BEGIN_YEARMONTH BEGIN_DAY BEGIN_TIME END_YEARMONTH END_DAY END_TIME
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 202503 31 1104 202503 31 1106
## 2 202503 30 1552 202503 30 1555
## 3 202501 5 1800 202501 6 2227
## 4 202501 3 1300 202501 3 1900
## 5 202501 3 1300 202501 3 1900
## 6 202501 3 1300 202501 3 1900
## # ℹ 65 more variables: EPISODE_ID.x <dbl>, EVENT_ID <dbl>, STATE <chr>,
## # STATE_FIPS <dbl>, YEAR <dbl>, MONTH_NAME <chr>, EVENT_TYPE <chr>,
## # CZ_TYPE <chr>, CZ_FIPS <dbl>, CZ_NAME <chr>, WFO <chr>,
## # BEGIN_DATE_TIME <chr>, CZ_TIMEZONE <chr>, END_DATE_TIME <chr>,
## # INJURIES_DIRECT <dbl>, INJURIES_INDIRECT <dbl>, DEATHS_DIRECT <dbl>,
## # DEATHS_INDIRECT <dbl>, DAMAGE_PROPERTY <chr>, DAMAGE_CROPS <chr>,
## # SOURCE <chr>, MAGNITUDE <dbl>, MAGNITUDE_TYPE <chr>, FLOOD_CAUSE <chr>, …
Joined_data is then cleaned (removed redundency) and is named “data”.
joined_data_clean <- joined_data %>%
select(-EPISODE_ID.y, -YEARMONTH, -LOCATION_INDEX)
names(joined_data_clean)
## [1] "BEGIN_YEARMONTH" "BEGIN_DAY" "BEGIN_TIME"
## [4] "END_YEARMONTH" "END_DAY" "END_TIME"
## [7] "EPISODE_ID.x" "EVENT_ID" "STATE"
## [10] "STATE_FIPS" "YEAR" "MONTH_NAME"
## [13] "EVENT_TYPE" "CZ_TYPE" "CZ_FIPS"
## [16] "CZ_NAME" "WFO" "BEGIN_DATE_TIME"
## [19] "CZ_TIMEZONE" "END_DATE_TIME" "INJURIES_DIRECT"
## [22] "INJURIES_INDIRECT" "DEATHS_DIRECT" "DEATHS_INDIRECT"
## [25] "DAMAGE_PROPERTY" "DAMAGE_CROPS" "SOURCE"
## [28] "MAGNITUDE" "MAGNITUDE_TYPE" "FLOOD_CAUSE"
## [31] "CATEGORY" "TOR_F_SCALE" "TOR_LENGTH"
## [34] "TOR_WIDTH" "TOR_OTHER_WFO" "TOR_OTHER_CZ_STATE"
## [37] "TOR_OTHER_CZ_FIPS" "TOR_OTHER_CZ_NAME" "BEGIN_RANGE"
## [40] "BEGIN_AZIMUTH" "BEGIN_LOCATION" "END_RANGE"
## [43] "END_AZIMUTH" "END_LOCATION" "BEGIN_LAT"
## [46] "BEGIN_LON" "END_LAT" "END_LON"
## [49] "EPISODE_NARRATIVE" "EVENT_NARRATIVE" "DATA_SOURCE"
## [52] "RANGE" "AZIMUTH" "LOCATION"
## [55] "LATITUDE" "LONGITUDE" "LAT2"
## [58] "LON2" "FAT_YEARMONTH" "FAT_DAY"
## [61] "FAT_TIME" "FATALITY_ID" "FATALITY_TYPE"
## [64] "FATALITY_DATE" "FATALITY_AGE" "FATALITY_SEX"
## [67] "FATALITY_LOCATION" "EVENT_YEARMONTH"
data <- joined_data_clean
Then new variables are made through cleaning, and added together to get total_injuries, total_deaths, health_impact, and economic_damage. These are used for our analysis.
data <- data %>%
mutate(
INJURIES_DIRECT = as.numeric(INJURIES_DIRECT ), INJURIES_DIRECT = replace_na(INJURIES_DIRECT, 0),
INJURIES_INDIRECT= as.numeric(INJURIES_INDIRECT), INJURIES_INDIRECT= replace_na(INJURIES_INDIRECT, 0),
DEATHS_DIRECT = as.numeric(DEATHS_DIRECT ), DEATHS_DIRECT = replace_na(DEATHS_DIRECT, 0),
DEATHS_INDIRECT = as.numeric(DEATHS_INDIRECT ), DEATHS_INDIRECT = replace_na(DEATHS_INDIRECT, 0),
DAMAGE_PROPERTY = as.numeric(DAMAGE_PROPERTY ), DAMAGE_PROPERTY = replace_na(DAMAGE_PROPERTY, 0),
DAMAGE_CROPS = as.numeric(DAMAGE_CROPS ), DAMAGE_CROPS = replace_na(DAMAGE_CROPS, 0)
) %>%
mutate(
total_injuries = INJURIES_DIRECT + INJURIES_INDIRECT,
total_deaths = DEATHS_DIRECT + DEATHS_INDIRECT,
health_impact = total_injuries + total_deaths,
economic_damage = DAMAGE_PROPERTY + DAMAGE_CROPS
)
## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `DAMAGE_PROPERTY = as.numeric(DAMAGE_PROPERTY)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
Ensuring that there is a “period of time” value present to help analysis, the data is further cleaned.
if ("MONTH_NAME" %in% names(data)) {
data <- data %>% mutate(month = factor(MONTH_NAME, levels = month.name))
} else if ("BEGIN_DATE_TIME" %in% names(data)) {
data <- data %>% mutate(month = factor(month(ymd_hms(BEGIN_DATE_TIME),
label = TRUE, abbr = FALSE), levels = month.name))
} else {
data <- data %>% mutate(month = NA)
}
data <- data %>%
mutate(EVENT_TYPE = str_to_title(trimws(as.character(EVENT_TYPE))))
In our first section of the analysis, we are asking the question “Across the United States, which types of events (as indicated in the EVENT_TYPE variable) are most harmful with respect to population health?” The data shows that the top three most harm full events are Flash Floods with 1356 casualties, Wildfires with 872 casualites, and Tornados with 162 causalites. A plot is then created showing the top 10 most harmful events.
health_by_event <- data %>%
group_by(EVENT_TYPE) %>%
summarise(
total_deaths = sum(total_deaths, na.rm = TRUE),
total_injuries = sum(total_injuries, na.rm = TRUE),
total_health = sum(health_impact, na.rm = TRUE),
n_events = n()
) %>%
arrange(desc(total_health))
top10_health <- health_by_event %>% slice_max(total_health, n = 10)
print(top10_health)
## # A tibble: 10 × 5
## EVENT_TYPE total_deaths total_injuries total_health n_events
## <chr> <dbl> <dbl> <dbl> <int>
## 1 Flash Flood 1356 12 1368 9338
## 2 Wildfire 872 413 1285 214
## 3 Tornado 162 730 892 2012
## 4 Dust Storm 91 417 508 237
## 5 Excessive Heat 309 172 481 438
## 6 Heat 173 33 206 538
## 7 Thunderstorm Wind 91 107 198 14432
## 8 Winter Storm 63 23 86 2102
## 9 Lightning 14 57 71 120
## 10 Winter Weather 27 44 71 2775
data_plot_harmevents <- top10_health %>% slice_max(total_health, n = 10) %>% arrange(total_health)
ggplot(data_plot_harmevents, aes(x = fct_reorder(EVENT_TYPE, total_health), y = total_health)) +
geom_col() +
coord_flip() +
labs(title = "Top Event Types by Total Health Impact (Deaths + Injuries)",
x = "Event Type", y = "Total health impact (count)") +
scale_y_continuous(labels = comma) +
theme_minimal()
## # A tibble: 10 × 5
## EVENT_TYPE total_deaths total_injuries total_health n_events
## <chr> <dbl> <dbl> <dbl> <int>
## 1 Lightning 14 57 71 120
## 2 Winter Weather 27 44 71 2775
## 3 Winter Storm 63 23 86 2102
## 4 Thunderstorm Wind 91 107 198 14432
## 5 Heat 173 33 206 538
## 6 Excessive Heat 309 172 481 438
## 7 Dust Storm 91 417 508 237
## 8 Tornado 162 730 892 2012
## 9 Wildfire 872 413 1285 214
## 10 Flash Flood 1356 12 1368 9338
The second part of our analysis asks “Across the United States, which types of events are most happening in which states?” Through the analysis, we are able to see which events happen and how frequently per state. The heat map plot demonstrates the top ten event types and in which states they occur with frequency. The message script states that the original database lacks any data for economic factors, and R will display 0s.
events_by_state_type <- data %>%
group_by(STATE, EVENT_TYPE) %>%
summarise(n_events = n(), total_health = sum(health_impact, na.rm = TRUE), total_econ = sum(economic_damage, na.rm = TRUE)) %>%
ungroup()
## `summarise()` has grouped output by 'STATE'. You can override using the
## `.groups` argument.
top_event_per_state <- events_by_state_type %>%
group_by(STATE) %>%
slice_max(n_events, n = 1, with_ties = FALSE) %>%
arrange(STATE)
print(top_event_per_state %>% slice(1:10))
## # A tibble: 67 × 5
## # Groups: STATE [67]
## STATE EVENT_TYPE n_events total_health total_econ
## <chr> <chr> <int> <dbl> <dbl>
## 1 ALABAMA Thunderstorm Wind 1091 15 0
## 2 ALASKA Blizzard 39 0 0
## 3 AMERICAN SAMOA Flash Flood 41 0 0
## 4 ARIZONA Flash Flood 73 0 0
## 5 ARKANSAS Thunderstorm Wind 247 7 0
## 6 ATLANTIC NORTH Marine Thunderstorm Wind 364 0 0
## 7 ATLANTIC SOUTH Marine Thunderstorm Wind 337 0 0
## 8 CALIFORNIA Flood 601 6 0
## 9 COLORADO Hail 267 0 0
## 10 CONNECTICUT Winter Weather 23 0 0
## # ℹ 57 more rows
message("NOAA currently has no data for total_econ in 2025, which is why it is 0")
## NOAA currently has no data for total_econ in 2025, which is why it is 0
top10_event_types <- events_by_state_type %>%
group_by(EVENT_TYPE) %>% summarise(total = sum(n_events)) %>%
arrange(desc(total)) %>% slice(1:10) %>% pull(EVENT_TYPE)
heatmap_top10 <- events_by_state_type %>%
filter(EVENT_TYPE %in% top10_event_types)
state_order <- heatmap_top10 %>% group_by(STATE) %>% summarise(s = sum(n_events)) %>% arrange(desc(s)) %>% pull(STATE)
heatmap_data <- heatmap_top10 %>% mutate(STATE = factor(STATE, levels = state_order))
ggplot(heatmap_top10, aes(x = STATE, y = fct_rev(factor(EVENT_TYPE)), fill = n_events)) +
geom_tile(color = "white") +
labs(title = "Counts of Top 10 Event Types by State",
x = "State", y = "Event Type", fill = "Event count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
## # A tibble: 413 × 5
## STATE EVENT_TYPE n_events total_health total_econ
## <chr> <chr> <int> <dbl> <dbl>
## 1 ALABAMA Drought 23 0 0
## 2 ALABAMA Extreme Cold/Wind Chill 6 0 0
## 3 ALABAMA Flash Flood 124 4 0
## 4 ALABAMA Flood 13 0 0
## 5 ALABAMA Hail 175 0 0
## 6 ALABAMA Thunderstorm Wind 1091 15 0
## 7 ALABAMA Tornado 108 28 0
## 8 ALABAMA Winter Storm 19 1 0
## 9 ALABAMA Winter Weather 16 0 0
## 10 ALASKA Extreme Cold/Wind Chill 3 0 0
## # ℹ 403 more rows
The third section of the analysis tests “Which types of events are characterized by which months?” The script creates a graph that shows the raising an lowering of certain event types based on time of year. For example, some event types such as thunderstorms become more frequent as the year aproaches summer. The plot below visualizes this.
top_types_by_count <- data %>% count(EVENT_TYPE, sort = TRUE) %>% slice_max(n, n = 5) %>% pull(EVENT_TYPE)
month_plot_data <- data %>%
filter(EVENT_TYPE %in% top_types_by_count, !is.na(month)) %>%
group_by(month, EVENT_TYPE) %>%
summarise(n_events = n(), avg_health = mean(health_impact, na.rm = TRUE)) %>%
ungroup()
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
month_plot_data$month <- factor(month_plot_data$month, levels = month.name)
ggplot(month_plot_data, aes(x = month, y = n_events, color = EVENT_TYPE, group = EVENT_TYPE)) +
geom_line(size = 1) + geom_point() +
labs(title = "Monthly Counts for Top Event Types",
x = "Month", y = "Event Count") +
theme_minimal() +
theme(legend.position = "bottom", axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## # A tibble: 30 × 4
## month EVENT_TYPE n_events avg_health
## <fct> <chr> <int> <dbl>
## 1 January Flash Flood 253 0
## 2 January Flood 569 0
## 3 January Hail 4 0
## 4 January High Wind 282 0.00709
## 5 January Thunderstorm Wind 56 0
## 6 February Flash Flood 770 0.0234
## 7 February Flood 1381 0.00724
## 8 February Hail 43 0
## 9 February High Wind 575 0.00696
## 10 February Thunderstorm Wind 583 0.00343
## # ℹ 20 more rows
In the final section of our analysis, we asks “Which source is most used to report events?” Based on the data, the “Public” variable, which essentially means word of mouth or shared to the public, is the main source of event reporting with 10,495 documented reports. The bar chart then demonstrates the five most used sources for event notice.
source_counts <- data %>%
group_by(SOURCE) %>%
summarise(
n_events = n()
) %>%
arrange(desc(n_events))
top5_sources <- source_counts %>% slice_max(n_events, n = 5)
print(top5_sources)
## # A tibble: 5 × 2
## SOURCE n_events
## <chr> <int>
## 1 Public 10495
## 2 Emergency Manager 6846
## 3 Mesonet 5261
## 4 911 Call Center 4913
## 5 Trained Spotter 3725
ggplot(top5_sources, aes(x = fct_reorder(SOURCE, n_events), y = n_events)) +
geom_col(fill = "darkorange") +
coord_flip() +
labs(
title = "Top 5 Reporting Sources for NOAA Storm Events",
x = "Reporting Source",
y = "Number of Events"
) +
theme_minimal() +
scale_y_continuous(labels = scales::comma)
## # A tibble: 5 × 2
## SOURCE n_events
## <chr> <int>
## 1 Public 10495
## 2 Emergency Manager 6846
## 3 Mesonet 5261
## 4 911 Call Center 4913
## 5 Trained Spotter 3725