Image source: https://www.noaa.gov/noaa-collections/photo-library/con00002jpg


Introduction

These weather phenomena lead to billion-dollar damages and many deaths annually in the U.S. This research focuses on the Storm Events Database of NOAA, which provides information about weather incidents and is stored at the National Centers for Environmental Information (NCEI) of the National Oceanic and Atmospheric Administration (NOAA). This database is produced by the National Weather Service (NWS), which is a subdivision of NOAA. It is important to note that NWS is the organization that creates and stores these records, rather than NOAA or its website, and NCEI, which only maintains these files.

How the data was collected: According to the documentation from the NOAA/NWS, each entry in Storm Data is written by a Warning Coordination Meteorologist of a National Weather Service Forecast Office. He/She puts together reports of the storm obtained from multiple sources including trained spotters, law enforcement agencies, emergency management services, public reports, newspaper, and ASOS into the “StormDat” data entry system. The numbers regarding property and crop losses are reported as the best estimate figures and are not inflated. For more details see NOAA’s NWS Directive 10-1605 (https://www.nws.noaa.gov/directives/sym/pd01016005curr.pdf).

Variables used in this project

Variable Type Description
event_type Categorical Type of severe weather event (e.g. Tornado, Flood, Hail)
state Categorical U.S. state where the event occurred
month_name Categorical Month the event began
cz_type Categorical County (C), Zone (Z), or Marine (M) designation
magnitude Quantitative Wind speed (knots) or hail size (inches), depending on event
injuries_direct Quantitative Direct injuries caused by the event
deaths_direct Quantitative Direct deaths caused by the event
damage_property Quantitative Estimated property damage, cleaned to numeric dollars
damage_crops Quantitative Estimated crop damage, cleaned to numeric dollars
begin_lat / begin_lon Quantitative Coordinates where the event began
begin_date_time Date/time When the event started

Questions explored: What are the types of extreme weather that do the most damage to property? Can the intensity of a storm help to predict the monetary damage done after taking into account the type of storm and its impact on people (injuries/deaths)?

Why this topic: This particular dataset was selected mainly due to the availability of numerous quantitative and qualitative variables, which are suitable for modeling and analysis in this project. Even though I am not personally involved in severe weather phenomena, the practical importance of this issue, tens of billion dollars of annual losses and direct impact on safety makes it an interesting area to analyze using regression and visualization.


Background Research

Costs of severe weather have significantly increased in the last decades in the United States. According to data on billion-dollar weather and climate disasters provided by Climate Central, there have been fourteen separate billion-dollar disasters in the first half of 2025 already, the total cost being over $101 billion. Moreover, on average there have been about three billion-dollar disasters annually in the 1980s but there have been nineteen such disasters annually in the last decade; average annual cost has more than quadrupled, too – from $22.6 billion in the 1980s to $102 billion in the 2010s. The database, which used to be managed by NOAA’s National Centers for Environmental Information, was terminated by the U.S. federal government in 2025, and is currently maintained by non-profit Climate Central to make sure that this record is uninterrupted. It is directly related to my analysis of NOAA Storm Events Database, as it means that costs of damage in more recent years in my dataset are going to be higher and more common than in earlier decades. Climate Central. (2025). 2025 in Review: U.S. Billion-Dollar Disasters. https://www.climatecentral.org/climate-matters/2025-in-review


Loading Libraries and Data

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.1     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.3     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(broom)
storms_raw <- read_csv("StormEvents_details-ftp_v1.0_d2024_c20260421.csv")
## Rows: 69801 Columns: 51
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (26): STATE, MONTH_NAME, EVENT_TYPE, CZ_TYPE, CZ_NAME, WFO, BEGIN_DATE_T...
## dbl (25): BEGIN_YEARMONTH, BEGIN_DAY, BEGIN_TIME, END_YEARMONTH, END_DAY, EN...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(storms_raw) <- tolower(names(storms_raw))
glimpse(storms_raw)
## Rows: 69,801
## Columns: 51
## $ begin_yearmonth    <dbl> 202404, 202407, 202411, 202405, 202405, 202405, 202…
## $ begin_day          <dbl> 30, 1, 16, 22, 21, 24, 1, 1, 14, 14, 17, 13, 17, 17…
## $ begin_time         <dbl> 2033, 0, 230, 1230, 1200, 1405, 0, 0, 1510, 1352, 1…
## $ end_yearmonth      <dbl> 202404, 202407, 202411, 202405, 202405, 202405, 202…
## $ end_day            <dbl> 30, 5, 18, 22, 21, 24, 1, 1, 14, 14, 18, 13, 18, 18…
## $ end_time           <dbl> 2033, 900, 1421, 1615, 1530, 1410, 1600, 1600, 1515…
## $ episode_id         <dbl> 189851, 193486, 197838, 191723, 191723, 191916, 197…
## $ event_id           <dbl> 1174463, 1195301, 1223377, 1184135, 1184133, 118234…
## $ state              <chr> "OKLAHOMA", "LOUISIANA", "OREGON", "TEXAS", "TEXAS"…
## $ state_fips         <dbl> 40, 22, 41, 48, 48, 28, 53, 41, 28, 47, 41, 53, 41,…
## $ year               <dbl> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 202…
## $ month_name         <chr> "April", "July", "November", "May", "May", "May", "…
## $ event_type         <chr> "Thunderstorm Wind", "Excessive Heat", "Heavy Snow"…
## $ cz_type            <chr> "C", "Z", "Z", "Z", "Z", "C", "Z", "Z", "C", "C", "…
## $ cz_fips            <dbl> 141, 18, 509, 254, 254, 115, 211, 127, 141, 71, 127…
## $ cz_name            <chr> "TILLMAN", "NATCHITOCHES", "EAST SLOPES OF THE OREG…
## $ wfo                <chr> "OUN", "SHV", "PDT", "BRO", "BRO", "MEG", "PQR", "P…
## $ begin_date_time    <chr> "30-APR-24 20:33:00", "01-JUL-24 00:00:00", "16-NOV…
## $ cz_timezone        <chr> "CST-6", "CST-6", "PST-8", "CST-6", "CST-6", "CST-6…
## $ end_date_time      <chr> "30-APR-24 20:33:00", "05-JUL-24 09:00:00", "18-NOV…
## $ injuries_direct    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ injuries_indirect  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ deaths_direct      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ deaths_indirect    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ damage_property    <chr> NA, "0.00K", "0.00K", "0.00K", "0.00K", "1.00K", "0…
## $ damage_crops       <chr> NA, "0.00K", "0.00K", "0.00K", "0.00K", "0.00K", "0…
## $ source             <chr> "ASOS", "ASOS", "SNOTEL", "ASOS", "ASOS", "Public",…
## $ magnitude          <dbl> 55.00, NA, NA, NA, NA, 52.00, NA, NA, 1.00, 0.88, N…
## $ magnitude_type     <chr> "MG", NA, NA, NA, NA, "EG", NA, NA, NA, NA, NA, "MG…
## $ flood_cause        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ category           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tor_f_scale        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tor_length         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tor_width          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tor_other_wfo      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tor_other_cz_state <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tor_other_cz_fips  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tor_other_cz_name  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ begin_range        <dbl> 0.42, NA, NA, NA, NA, 0.00, NA, NA, 0.90, 1.49, NA,…
## $ begin_azimuth      <chr> "SSW", NA, NA, NA, NA, "N", NA, NA, "NW", "SSE", NA…
## $ begin_location     <chr> "FREDERICK ARPT", NA, NA, NA, NA, "ALGOMA", NA, NA,…
## $ end_range          <dbl> 0.42, NA, NA, NA, NA, 0.00, NA, NA, 0.90, 1.49, NA,…
## $ end_azimuth        <chr> "SSW", NA, NA, NA, NA, "N", NA, NA, "NW", "SSE", NA…
## $ end_location       <chr> "FREDERICK ARPT", NA, NA, NA, NA, "ALGOMA", NA, NA,…
## $ begin_lat          <dbl> 34.3444, NA, NA, NA, NA, 34.1800, NA, NA, 34.5100, …
## $ begin_lon          <dbl> -98.9830, NA, NA, NA, NA, -89.0300, NA, NA, -88.210…
## $ end_lat            <dbl> 34.3444, NA, NA, NA, NA, 34.1800, NA, NA, 34.5100, …
## $ end_lon            <dbl> -98.9830, NA, NA, NA, NA, -89.0300, NA, NA, -88.210…
## $ episode_narrative  <chr> "A rather nebulous upper air pattern existed across…
## $ event_narrative    <chr> "Frederick Municipal Airport (KFDR) observation.", …
## $ data_source        <chr> "CSV", "CSV", "CSV", "CSV", "CSV", "CSV", "CSV", "C…


Data Cleaning and Wrangling

Parsing the damage columns

The damage_property and damage_crops fields are stored as text like "10.00K" or "1.50M" rather than numbers, so I first converted them to true numeric dollar amounts.

parse_damage <- function(x) {
  x <- toupper(trimws(x))
  mult <- case_when(
    str_detect(x, "K$") ~ 1e3,
    str_detect(x, "M$") ~ 1e6,
    str_detect(x, "B$") ~ 1e9,
    TRUE ~ 1
  )
  num <- as.numeric(str_remove(x, "[KMB]$"))
  num * mult
}
parse_damage
## function (x) 
## {
##     x <- toupper(trimws(x))
##     mult <- case_when(str_detect(x, "K$") ~ 1000, str_detect(x, 
##         "M$") ~ 1e+06, str_detect(x, "B$") ~ 1e+09, TRUE ~ 1)
##     num <- as.numeric(str_remove(x, "[KMB]$"))
##     num * mult
## }
storms_raw <- storms_raw %>%
  mutate(
    damage_property_num = parse_damage(damage_property),
    damage_crops_num   = parse_damage(damage_crops)
  )
storms_raw
## # A tibble: 69,801 × 53
##    begin_yearmonth begin_day begin_time end_yearmonth end_day end_time
##              <dbl>     <dbl>      <dbl>         <dbl>   <dbl>    <dbl>
##  1          202404        30       2033        202404      30     2033
##  2          202407         1          0        202407       5      900
##  3          202411        16        230        202411      18     1421
##  4          202405        22       1230        202405      22     1615
##  5          202405        21       1200        202405      21     1530
##  6          202405        24       1405        202405      24     1410
##  7          202411         1          0        202411       1     1600
##  8          202411         1          0        202411       1     1600
##  9          202405        14       1510        202405      14     1515
## 10          202405        14       1352        202405      14     1357
## # ℹ 69,791 more rows
## # ℹ 47 more variables: episode_id <dbl>, event_id <dbl>, state <chr>,
## #   state_fips <dbl>, year <dbl>, month_name <chr>, event_type <chr>,
## #   cz_type <chr>, cz_fips <dbl>, cz_name <chr>, wfo <chr>,
## #   begin_date_time <chr>, cz_timezone <chr>, end_date_time <chr>,
## #   injuries_direct <dbl>, injuries_indirect <dbl>, deaths_direct <dbl>,
## #   deaths_indirect <dbl>, damage_property <chr>, damage_crops <chr>, …
storms_clean <- storms_raw %>%
  select(event_type, state, magnitude, injuries_direct, deaths_direct,
         damage_property_num, damage_crops_num, begin_lat, begin_lon) %>%
  filter(!is.na(magnitude), magnitude > 0) %>%
  mutate(total_damage = damage_property_num + damage_crops_num,
         log_damage = log1p(total_damage)) %>%
  arrange(desc(total_damage))
storms_clean
## # A tibble: 36,467 × 11
##    event_type  state magnitude injuries_direct deaths_direct damage_property_num
##    <chr>       <chr>     <dbl>           <dbl>         <dbl>               <dbl>
##  1 Hail        NEBR…      2.75               0             0            25000000
##  2 High Wind   VIRG…     56                  3             0            15500000
##  3 Thundersto… TEXAS     70                  0             0            70000000
##  4 High Wind   VIRG…     50                  0             0            13800000
##  5 Thundersto… TEXAS    104                 13             0            35000000
##  6 High Wind   VIRG…     50                  0             0            17000000
##  7 Hail        NEBR…      4                  0             0            25000000
##  8 Thundersto… ARIZ…     66                  0             0            25000000
##  9 Hail        MISS…      3                  0             0            20000000
## 10 Thundersto… ARKA…     83                  0             0            20000000
## # ℹ 36,457 more rows
## # ℹ 5 more variables: damage_crops_num <dbl>, begin_lat <dbl>, begin_lon <dbl>,
## #   total_damage <dbl>, log_damage <dbl>

Summarizing damage by event type

damage_by_type <- storms_clean %>%
  group_by(event_type) %>%
  summarize(                             
    n_events       = n(),
    total_damage   = sum(total_damage, na.rm = TRUE),
    mean_injuries  = mean(injuries_direct, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(total_damage))

damage_by_type
## # A tibble: 8 × 4
##   event_type               n_events total_damage mean_injuries
##   <chr>                       <int>        <dbl>         <dbl>
## 1 Thunderstorm Wind           19940    364444500       0.00727
## 2 Hail                         8941    341407130       0.00168
## 3 High Wind                    3994    180328300       0.00501
## 4 Strong Wind                  1311      4626580       0.0153 
## 5 Marine Thunderstorm Wind     2192        10500       0      
## 6 Marine Strong Wind              3         1000       1      
## 7 Marine Hail                    22            0       0      
## 8 Marine High Wind               64            0       0

The most damaging events are the Thunderstorm Winds ($364 million) and Hail ($341 million) within the 2024 storms recorded in this database, totaling more than 75% of the total amount of damages of the highest ranked events. One would expect that there were many injuries caused by such events, but in reality the mean number of injuries per event is quite low (below 0.02).


Statistical Analysis: Multiple Linear Regression

We model property + crop damage (log-transformed) as a function of storm magnitude, direct injuries, and direct deaths. A log transform is used because raw damage dollars are extremely right-skewed (a small number of catastrophic events dominate).

model <- lm(log_damage ~ magnitude + injuries_direct + deaths_direct,
            data = storms_clean)
summary(model)
## 
## Call:
## lm(formula = log_damage ~ magnitude + injuries_direct + deaths_direct, 
##     data = storms_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.206  -3.353  -1.326   4.000  17.035 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.276215   0.050219  25.413  < 2e-16 ***
## magnitude       0.039941   0.001075  37.163  < 2e-16 ***
## injuries_direct 0.814951   0.165355   4.928 8.34e-07 ***
## deaths_direct   1.783461   0.491317   3.630 0.000284 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.992 on 25787 degrees of freedom
##   (10676 observations deleted due to missingness)
## Multiple R-squared:  0.05288,    Adjusted R-squared:  0.05277 
## F-statistic:   480 on 3 and 25787 DF,  p-value: < 2.2e-16

Model equation:

\[Y = \beta_0 + \beta_1(X_1) + \beta_2(X_2) + \epsilon\]

tidy(model) %>%
  mutate(across(where(is.numeric), ~round(., 4)))
## # A tibble: 4 × 5
##   term            estimate std.error statistic p.value
##   <chr>              <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)       1.28      0.0502     25.4   0     
## 2 magnitude         0.0399    0.0011     37.2   0     
## 3 injuries_direct   0.815     0.165       4.93  0     
## 4 deaths_direct     1.78      0.491       3.63  0.0003
glance(model) %>%
  select(r.squared, adj.r.squared, p.value)
## # A tibble: 1 × 3
##   r.squared adj.r.squared   p.value
##       <dbl>         <dbl>     <dbl>
## 1    0.0529        0.0528 1.65e-303

Diagnostic plots

par(mfrow = c(2, 2))
plot(model)

Interpretation: The three predictors, all at p < 0.001 level of significance, are: magnitude (p < 2e-16), direct injuries (p = 8.34e-07), and direct deaths (p = 0.000284). A one unit increase in magnitude increases damage by 4%, but a unit increase in injuries/deaths increases the damage to a greater extent, deaths in particular have the greatest influence with coefficient of 1.78. But the adjusted R² is just 0.053. This means that the three variables mentioned above account for only 5% of the variation in the log damage. Thus, it can be concluded that even though the magnitude, injuries and deaths are important predictors, more than 95% of the variation in storm damage is due to some other factors that have not been included in this model – and those must include event types, geographic location and population of the area affected by the storms. The residual plots indicate funnel shape pattern for Residuals vs Fitted plot, indicating that despite log transformation, there is still heteroskedasticity.


Visualization 1: Total Damage by Event Type

damage_by_type <- storms_clean %>%
  group_by(event_type) %>%
  summarize(total_damage = sum(total_damage, na.rm = TRUE)) %>%
  slice_max(total_damage, n = 10)

p1 <- ggplot(damage_by_type, aes(x = reorder(event_type, total_damage),
                                 y = total_damage, fill = event_type)) +
  geom_col() +
  coord_flip() +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Top 10 Severe Weather Event Types by Total Damage (2024)",
       x = "Event Type", y = "Total Damage (USD)",
       caption = "Source: NOAA National Weather Service, Storm Events Database") +
  theme_minimal()

ggplotly(p1)


Visualization 2: Geographic Distribution of Severe Events

p2 <- ggplot(storms_clean, aes(x = begin_lon, y = begin_lat,
                                color = event_type, size = magnitude)) +
  geom_point(alpha = 0.5) +
  scale_color_brewer(palette = "Set2") +               # non-default palette
  labs(
    title    = "Geographic Distribution of Severe Weather Events",
    subtitle = "Point size reflects reported storm magnitude",
    x        = "Longitude",
    y        = "Latitude",
    color    = "Event Type",
    size     = "Magnitude",
    caption  = "Source: NOAA National Weather Service, Storm Events Database (NCEI)"
  ) +
  theme_dark(base_size = 13) +                        
  annotate("text", x = -100, y = 25,
           label = "Gulf Coast events tend to run larger",
           color = "white", fontface = "italic", size = 3.2)

p2
## Warning: Removed 5305 rows containing missing values or values outside the scale range
## (`geom_point()`).


Essay: Reflection

What the visualizations represent

Visualization 1 demonstrates that “Thunderstorm Wind” and “Hail” are the two most costly event types for 2024, as “Thunderstorm Wind” causes more than $364 million worth of damages alone. Visualization 2 depicts the geographical distribution of extreme weather events in the continental United States, and a cluster of large magnitude events can be seen on the Gulf Coast Region.

Interesting patterns or surprises

Another unexpected finding was how poor the explanatory strength of the regression model was (adjusted R² of just 0.053) even though each independent variable was statistically significant — showing that there is a difference between statistical significance and explanatory strength. Additionally, I thought that injuries and deaths would be more highly correlated with damage than they turned out to be in terms of the coefficients.

What you wish you could have included

Given additional time, it would have been nice to include event_type as a factor variable in the regression equation because Visualization 1 makes it apparent that it probably explains much more of the variability in damage than just magnitude does. It would have also been nice to look at 2024 against a time series to find out if the trend seen in the literature still exists in this dataset.


References

Climate Central. (2025). 2025 in Review: U.S. Billion-Dollar Disasters. https://www.climatecentral.org/climate-matters/2025-in-review

National Weather Service. Storm Data Preparation, NWS Directive 10-1605. National Oceanic and Atmospheric Administration. https://www.nws.noaa.gov/directives/sym/pd01016005curr.pdf

NOAA National Centers for Environmental Information. Storm Events Database. https://www.ncei.noaa.gov/stormevents/