Image source: https://www.noaa.gov/noaa-collections/photo-library/con00002jpg
These weather phenomena lead to billion-dollar damages and many deaths annually in the U.S. This research focuses on the Storm Events Database of NOAA, which provides information about weather incidents and is stored at the National Centers for Environmental Information (NCEI) of the National Oceanic and Atmospheric Administration (NOAA). This database is produced by the National Weather Service (NWS), which is a subdivision of NOAA. It is important to note that NWS is the organization that creates and stores these records, rather than NOAA or its website, and NCEI, which only maintains these files.
How the data was collected: According to the documentation from the NOAA/NWS, each entry in Storm Data is written by a Warning Coordination Meteorologist of a National Weather Service Forecast Office. He/She puts together reports of the storm obtained from multiple sources including trained spotters, law enforcement agencies, emergency management services, public reports, newspaper, and ASOS into the “StormDat” data entry system. The numbers regarding property and crop losses are reported as the best estimate figures and are not inflated. For more details see NOAA’s NWS Directive 10-1605 (https://www.nws.noaa.gov/directives/sym/pd01016005curr.pdf).
Variables used in this project
| Variable | Type | Description |
|---|---|---|
event_type |
Categorical | Type of severe weather event (e.g. Tornado, Flood, Hail) |
state |
Categorical | U.S. state where the event occurred |
month_name |
Categorical | Month the event began |
cz_type |
Categorical | County (C), Zone (Z), or Marine (M) designation |
magnitude |
Quantitative | Wind speed (knots) or hail size (inches), depending on event |
injuries_direct |
Quantitative | Direct injuries caused by the event |
deaths_direct |
Quantitative | Direct deaths caused by the event |
damage_property |
Quantitative | Estimated property damage, cleaned to numeric dollars |
damage_crops |
Quantitative | Estimated crop damage, cleaned to numeric dollars |
begin_lat / begin_lon |
Quantitative | Coordinates where the event began |
begin_date_time |
Date/time | When the event started |
Questions explored: What are the types of extreme weather that do the most damage to property? Can the intensity of a storm help to predict the monetary damage done after taking into account the type of storm and its impact on people (injuries/deaths)?
Why this topic: This particular dataset was selected mainly due to the availability of numerous quantitative and qualitative variables, which are suitable for modeling and analysis in this project. Even though I am not personally involved in severe weather phenomena, the practical importance of this issue, tens of billion dollars of annual losses and direct impact on safety makes it an interesting area to analyze using regression and visualization.
Costs of severe weather have significantly increased in the last decades in the United States. According to data on billion-dollar weather and climate disasters provided by Climate Central, there have been fourteen separate billion-dollar disasters in the first half of 2025 already, the total cost being over $101 billion. Moreover, on average there have been about three billion-dollar disasters annually in the 1980s but there have been nineteen such disasters annually in the last decade; average annual cost has more than quadrupled, too – from $22.6 billion in the 1980s to $102 billion in the 2010s. The database, which used to be managed by NOAA’s National Centers for Environmental Information, was terminated by the U.S. federal government in 2025, and is currently maintained by non-profit Climate Central to make sure that this record is uninterrupted. It is directly related to my analysis of NOAA Storm Events Database, as it means that costs of damage in more recent years in my dataset are going to be higher and more common than in earlier decades. Climate Central. (2025). 2025 in Review: U.S. Billion-Dollar Disasters. https://www.climatecentral.org/climate-matters/2025-in-review
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.1 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.3 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(broom)
storms_raw <- read_csv("StormEvents_details-ftp_v1.0_d2024_c20260421.csv")
## Rows: 69801 Columns: 51
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (26): STATE, MONTH_NAME, EVENT_TYPE, CZ_TYPE, CZ_NAME, WFO, BEGIN_DATE_T...
## dbl (25): BEGIN_YEARMONTH, BEGIN_DAY, BEGIN_TIME, END_YEARMONTH, END_DAY, EN...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(storms_raw) <- tolower(names(storms_raw))
glimpse(storms_raw)
## Rows: 69,801
## Columns: 51
## $ begin_yearmonth <dbl> 202404, 202407, 202411, 202405, 202405, 202405, 202…
## $ begin_day <dbl> 30, 1, 16, 22, 21, 24, 1, 1, 14, 14, 17, 13, 17, 17…
## $ begin_time <dbl> 2033, 0, 230, 1230, 1200, 1405, 0, 0, 1510, 1352, 1…
## $ end_yearmonth <dbl> 202404, 202407, 202411, 202405, 202405, 202405, 202…
## $ end_day <dbl> 30, 5, 18, 22, 21, 24, 1, 1, 14, 14, 18, 13, 18, 18…
## $ end_time <dbl> 2033, 900, 1421, 1615, 1530, 1410, 1600, 1600, 1515…
## $ episode_id <dbl> 189851, 193486, 197838, 191723, 191723, 191916, 197…
## $ event_id <dbl> 1174463, 1195301, 1223377, 1184135, 1184133, 118234…
## $ state <chr> "OKLAHOMA", "LOUISIANA", "OREGON", "TEXAS", "TEXAS"…
## $ state_fips <dbl> 40, 22, 41, 48, 48, 28, 53, 41, 28, 47, 41, 53, 41,…
## $ year <dbl> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 202…
## $ month_name <chr> "April", "July", "November", "May", "May", "May", "…
## $ event_type <chr> "Thunderstorm Wind", "Excessive Heat", "Heavy Snow"…
## $ cz_type <chr> "C", "Z", "Z", "Z", "Z", "C", "Z", "Z", "C", "C", "…
## $ cz_fips <dbl> 141, 18, 509, 254, 254, 115, 211, 127, 141, 71, 127…
## $ cz_name <chr> "TILLMAN", "NATCHITOCHES", "EAST SLOPES OF THE OREG…
## $ wfo <chr> "OUN", "SHV", "PDT", "BRO", "BRO", "MEG", "PQR", "P…
## $ begin_date_time <chr> "30-APR-24 20:33:00", "01-JUL-24 00:00:00", "16-NOV…
## $ cz_timezone <chr> "CST-6", "CST-6", "PST-8", "CST-6", "CST-6", "CST-6…
## $ end_date_time <chr> "30-APR-24 20:33:00", "05-JUL-24 09:00:00", "18-NOV…
## $ injuries_direct <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ injuries_indirect <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ deaths_direct <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ deaths_indirect <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ damage_property <chr> NA, "0.00K", "0.00K", "0.00K", "0.00K", "1.00K", "0…
## $ damage_crops <chr> NA, "0.00K", "0.00K", "0.00K", "0.00K", "0.00K", "0…
## $ source <chr> "ASOS", "ASOS", "SNOTEL", "ASOS", "ASOS", "Public",…
## $ magnitude <dbl> 55.00, NA, NA, NA, NA, 52.00, NA, NA, 1.00, 0.88, N…
## $ magnitude_type <chr> "MG", NA, NA, NA, NA, "EG", NA, NA, NA, NA, NA, "MG…
## $ flood_cause <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ category <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tor_f_scale <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tor_length <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tor_width <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tor_other_wfo <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tor_other_cz_state <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tor_other_cz_fips <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tor_other_cz_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ begin_range <dbl> 0.42, NA, NA, NA, NA, 0.00, NA, NA, 0.90, 1.49, NA,…
## $ begin_azimuth <chr> "SSW", NA, NA, NA, NA, "N", NA, NA, "NW", "SSE", NA…
## $ begin_location <chr> "FREDERICK ARPT", NA, NA, NA, NA, "ALGOMA", NA, NA,…
## $ end_range <dbl> 0.42, NA, NA, NA, NA, 0.00, NA, NA, 0.90, 1.49, NA,…
## $ end_azimuth <chr> "SSW", NA, NA, NA, NA, "N", NA, NA, "NW", "SSE", NA…
## $ end_location <chr> "FREDERICK ARPT", NA, NA, NA, NA, "ALGOMA", NA, NA,…
## $ begin_lat <dbl> 34.3444, NA, NA, NA, NA, 34.1800, NA, NA, 34.5100, …
## $ begin_lon <dbl> -98.9830, NA, NA, NA, NA, -89.0300, NA, NA, -88.210…
## $ end_lat <dbl> 34.3444, NA, NA, NA, NA, 34.1800, NA, NA, 34.5100, …
## $ end_lon <dbl> -98.9830, NA, NA, NA, NA, -89.0300, NA, NA, -88.210…
## $ episode_narrative <chr> "A rather nebulous upper air pattern existed across…
## $ event_narrative <chr> "Frederick Municipal Airport (KFDR) observation.", …
## $ data_source <chr> "CSV", "CSV", "CSV", "CSV", "CSV", "CSV", "CSV", "C…
The damage_property and damage_crops fields
are stored as text like "10.00K" or "1.50M"
rather than numbers, so I first converted them to true numeric dollar
amounts.
parse_damage <- function(x) {
x <- toupper(trimws(x))
mult <- case_when(
str_detect(x, "K$") ~ 1e3,
str_detect(x, "M$") ~ 1e6,
str_detect(x, "B$") ~ 1e9,
TRUE ~ 1
)
num <- as.numeric(str_remove(x, "[KMB]$"))
num * mult
}
parse_damage
## function (x)
## {
## x <- toupper(trimws(x))
## mult <- case_when(str_detect(x, "K$") ~ 1000, str_detect(x,
## "M$") ~ 1e+06, str_detect(x, "B$") ~ 1e+09, TRUE ~ 1)
## num <- as.numeric(str_remove(x, "[KMB]$"))
## num * mult
## }
storms_raw <- storms_raw %>%
mutate(
damage_property_num = parse_damage(damage_property),
damage_crops_num = parse_damage(damage_crops)
)
storms_raw
## # A tibble: 69,801 × 53
## begin_yearmonth begin_day begin_time end_yearmonth end_day end_time
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 202404 30 2033 202404 30 2033
## 2 202407 1 0 202407 5 900
## 3 202411 16 230 202411 18 1421
## 4 202405 22 1230 202405 22 1615
## 5 202405 21 1200 202405 21 1530
## 6 202405 24 1405 202405 24 1410
## 7 202411 1 0 202411 1 1600
## 8 202411 1 0 202411 1 1600
## 9 202405 14 1510 202405 14 1515
## 10 202405 14 1352 202405 14 1357
## # ℹ 69,791 more rows
## # ℹ 47 more variables: episode_id <dbl>, event_id <dbl>, state <chr>,
## # state_fips <dbl>, year <dbl>, month_name <chr>, event_type <chr>,
## # cz_type <chr>, cz_fips <dbl>, cz_name <chr>, wfo <chr>,
## # begin_date_time <chr>, cz_timezone <chr>, end_date_time <chr>,
## # injuries_direct <dbl>, injuries_indirect <dbl>, deaths_direct <dbl>,
## # deaths_indirect <dbl>, damage_property <chr>, damage_crops <chr>, …
storms_clean <- storms_raw %>%
select(event_type, state, magnitude, injuries_direct, deaths_direct,
damage_property_num, damage_crops_num, begin_lat, begin_lon) %>%
filter(!is.na(magnitude), magnitude > 0) %>%
mutate(total_damage = damage_property_num + damage_crops_num,
log_damage = log1p(total_damage)) %>%
arrange(desc(total_damage))
storms_clean
## # A tibble: 36,467 × 11
## event_type state magnitude injuries_direct deaths_direct damage_property_num
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Hail NEBR… 2.75 0 0 25000000
## 2 High Wind VIRG… 56 3 0 15500000
## 3 Thundersto… TEXAS 70 0 0 70000000
## 4 High Wind VIRG… 50 0 0 13800000
## 5 Thundersto… TEXAS 104 13 0 35000000
## 6 High Wind VIRG… 50 0 0 17000000
## 7 Hail NEBR… 4 0 0 25000000
## 8 Thundersto… ARIZ… 66 0 0 25000000
## 9 Hail MISS… 3 0 0 20000000
## 10 Thundersto… ARKA… 83 0 0 20000000
## # ℹ 36,457 more rows
## # ℹ 5 more variables: damage_crops_num <dbl>, begin_lat <dbl>, begin_lon <dbl>,
## # total_damage <dbl>, log_damage <dbl>
damage_by_type <- storms_clean %>%
group_by(event_type) %>%
summarize(
n_events = n(),
total_damage = sum(total_damage, na.rm = TRUE),
mean_injuries = mean(injuries_direct, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(desc(total_damage))
damage_by_type
## # A tibble: 8 × 4
## event_type n_events total_damage mean_injuries
## <chr> <int> <dbl> <dbl>
## 1 Thunderstorm Wind 19940 364444500 0.00727
## 2 Hail 8941 341407130 0.00168
## 3 High Wind 3994 180328300 0.00501
## 4 Strong Wind 1311 4626580 0.0153
## 5 Marine Thunderstorm Wind 2192 10500 0
## 6 Marine Strong Wind 3 1000 1
## 7 Marine Hail 22 0 0
## 8 Marine High Wind 64 0 0
The most damaging events are the Thunderstorm Winds ($364 million) and Hail ($341 million) within the 2024 storms recorded in this database, totaling more than 75% of the total amount of damages of the highest ranked events. One would expect that there were many injuries caused by such events, but in reality the mean number of injuries per event is quite low (below 0.02).
We model property + crop damage (log-transformed) as a function of storm magnitude, direct injuries, and direct deaths. A log transform is used because raw damage dollars are extremely right-skewed (a small number of catastrophic events dominate).
model <- lm(log_damage ~ magnitude + injuries_direct + deaths_direct,
data = storms_clean)
summary(model)
##
## Call:
## lm(formula = log_damage ~ magnitude + injuries_direct + deaths_direct,
## data = storms_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.206 -3.353 -1.326 4.000 17.035
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.276215 0.050219 25.413 < 2e-16 ***
## magnitude 0.039941 0.001075 37.163 < 2e-16 ***
## injuries_direct 0.814951 0.165355 4.928 8.34e-07 ***
## deaths_direct 1.783461 0.491317 3.630 0.000284 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.992 on 25787 degrees of freedom
## (10676 observations deleted due to missingness)
## Multiple R-squared: 0.05288, Adjusted R-squared: 0.05277
## F-statistic: 480 on 3 and 25787 DF, p-value: < 2.2e-16
Model equation:
\[Y = \beta_0 + \beta_1(X_1) + \beta_2(X_2) + \epsilon\]
tidy(model) %>%
mutate(across(where(is.numeric), ~round(., 4)))
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.28 0.0502 25.4 0
## 2 magnitude 0.0399 0.0011 37.2 0
## 3 injuries_direct 0.815 0.165 4.93 0
## 4 deaths_direct 1.78 0.491 3.63 0.0003
glance(model) %>%
select(r.squared, adj.r.squared, p.value)
## # A tibble: 1 × 3
## r.squared adj.r.squared p.value
## <dbl> <dbl> <dbl>
## 1 0.0529 0.0528 1.65e-303
par(mfrow = c(2, 2))
plot(model)
Interpretation: The three predictors, all at p < 0.001 level of significance, are: magnitude (p < 2e-16), direct injuries (p = 8.34e-07), and direct deaths (p = 0.000284). A one unit increase in magnitude increases damage by 4%, but a unit increase in injuries/deaths increases the damage to a greater extent, deaths in particular have the greatest influence with coefficient of 1.78. But the adjusted R² is just 0.053. This means that the three variables mentioned above account for only 5% of the variation in the log damage. Thus, it can be concluded that even though the magnitude, injuries and deaths are important predictors, more than 95% of the variation in storm damage is due to some other factors that have not been included in this model – and those must include event types, geographic location and population of the area affected by the storms. The residual plots indicate funnel shape pattern for Residuals vs Fitted plot, indicating that despite log transformation, there is still heteroskedasticity.
damage_by_type <- storms_clean %>%
group_by(event_type) %>%
summarize(total_damage = sum(total_damage, na.rm = TRUE)) %>%
slice_max(total_damage, n = 10)
p1 <- ggplot(damage_by_type, aes(x = reorder(event_type, total_damage),
y = total_damage, fill = event_type)) +
geom_col() +
coord_flip() +
scale_fill_brewer(palette = "Set2") +
labs(title = "Top 10 Severe Weather Event Types by Total Damage (2024)",
x = "Event Type", y = "Total Damage (USD)",
caption = "Source: NOAA National Weather Service, Storm Events Database") +
theme_minimal()
ggplotly(p1)
p2 <- ggplot(storms_clean, aes(x = begin_lon, y = begin_lat,
color = event_type, size = magnitude)) +
geom_point(alpha = 0.5) +
scale_color_brewer(palette = "Set2") + # non-default palette
labs(
title = "Geographic Distribution of Severe Weather Events",
subtitle = "Point size reflects reported storm magnitude",
x = "Longitude",
y = "Latitude",
color = "Event Type",
size = "Magnitude",
caption = "Source: NOAA National Weather Service, Storm Events Database (NCEI)"
) +
theme_dark(base_size = 13) +
annotate("text", x = -100, y = 25,
label = "Gulf Coast events tend to run larger",
color = "white", fontface = "italic", size = 3.2)
p2
## Warning: Removed 5305 rows containing missing values or values outside the scale range
## (`geom_point()`).
Visualization 1 demonstrates that “Thunderstorm Wind” and “Hail” are the two most costly event types for 2024, as “Thunderstorm Wind” causes more than $364 million worth of damages alone. Visualization 2 depicts the geographical distribution of extreme weather events in the continental United States, and a cluster of large magnitude events can be seen on the Gulf Coast Region.
Another unexpected finding was how poor the explanatory strength of the regression model was (adjusted R² of just 0.053) even though each independent variable was statistically significant — showing that there is a difference between statistical significance and explanatory strength. Additionally, I thought that injuries and deaths would be more highly correlated with damage than they turned out to be in terms of the coefficients.
Given additional time, it would have been nice to include event_type as a factor variable in the regression equation because Visualization 1 makes it apparent that it probably explains much more of the variability in damage than just magnitude does. It would have also been nice to look at 2024 against a time series to find out if the trend seen in the literature still exists in this dataset.
Climate Central. (2025). 2025 in Review: U.S. Billion-Dollar Disasters. https://www.climatecentral.org/climate-matters/2025-in-review
National Weather Service. Storm Data Preparation, NWS Directive 10-1605. National Oceanic and Atmospheric Administration. https://www.nws.noaa.gov/directives/sym/pd01016005curr.pdf
NOAA National Centers for Environmental Information. Storm Events Database. https://www.ncei.noaa.gov/stormevents/