1 Synopsis

The data set contained information regarding major weather events across the U.S., starting in 1950 and ending in November 2011. It was sourced from the NOAA storm database, comprising 902,297 records across 37 variables in CSV format compressed via the bzip2 algorithm.

The project set out to identify the weather events that were most harmful to population health and those that had the greatest economic consequences.

Since the data set was extensive, the project utilized the data.table package for both data acquisition and preprocessing, while the results were produced in the tidyverse.

Data profiling revealed inconsistent categorical labels, motivating a structured preprocessing pipeline.

The preprocessing pipeline consisted of many steps. First, dates before Jan 1st, 1996 were filtered out, because NOAA didn’t begin recording all weather event types until that date. Second, str_extract() was used for automatic and bulk transformation of most unique values of the “event_type” column,and then, fcase() was employed to manually gather the remaining unique values into groups of weather events, while data entry errors were filtered out. Third, two derived columns were produced holding the total property and agricultural damages, after recoding their metric prefixes columns. Last, summary statistics were calculated for each unique weather event type.

Regarding the health consequences, tornadoes had the most devastating effect, with a total number of 22183 fatalities and injuries. Floods and heat followed with half the previous count, while thunderstorm winds, lightning, and winter storms were also dangerous. Finally, heat was responsible for the most deaths out of all event types, with tornadoes and floods following.

With regard to the economic consequences, floods were undoubtedly the most catastrophic, resulting in a total of 170.8 billion dollars in damages. Moreover, hurricanes were also calamitous, leading to a financial loss of 95.4 billion dollars, while tornadoes and hail followed. Last, drought was the most destructive agriculturally, causing $13.4B in crop damages.

2 Data Acquisition

library(tidyverse)
library(data.table)

The “repdata_data_StormData.csv” file holds 902,297 rows and 37 columns. Hence, data.table::fread() will be used to read in the data, which is faster than read.csv() and read_csv().

dt <- fread(cmd = "bzip2 -dc repdata_data_StormData.csv.bz2", 
            header = T, sep = ",", showProgress = F)

3 Data Profiling

The inspection will be held through the prism of the relevant columns for this project:

  • BGN_DATE, which represents the day the weather event started.
  • EVTYPE, which represents the type of the weather event.
  • FATALITIES & INJURIES, which are the counts of direct human deaths and injuries attributed to the event, respectively.
  • PROPDMG & PROPDMGEXP, which are the estimated property damage and its exponent, respectively.
  • CROPDMG & CROPDMGEXP, which are the estimated crop damage and its exponent, respectively.

Important Note:

  • As explained in the Storm Data FAQ Page (Section:“How are tornadoes counted?”), the health and economic consequences for multi-county events (major weather events crossing many counties) are segmented. In other words, when an event crosses many counties the consequences of that event are spread across the respective rows of the different counties the event crosses. That way, all multi-county events are treated as separate events that occur in different counties, where each row that belongs to the same multi-county event carries its own consequences (instead of recycled aggregates representing the total amounts). Hence, during the preprocessing stage, grouping and aggregation will occur so that the overall impact of each event type can be calculated.
relevant_column_names <- c("BGN_DATE","EVTYPE", "FATALITIES", "INJURIES", 
                           "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")

Transposed summary of the data.table:

glimpse(dt[, ..relevant_column_names])
## Rows: 902,297
## Columns: 8
## $ BGN_DATE   <chr> "4/18/1950 0:00:00", "4/18/1950 0:00:00", "2/20/1951 0:00:0~
## $ EVTYPE     <chr> "TORNADO", "TORNADO", "TORNADO", "TORNADO", "TORNADO", "TOR~
## $ FATALITIES <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 4, 0, 0, 0, 0,~
## $ INJURIES   <dbl> 15, 0, 2, 2, 2, 6, 1, 0, 14, 0, 3, 3, 26, 12, 6, 50, 2, 0, ~
## $ PROPDMG    <dbl> 25.0, 2.5, 25.0, 2.5, 2.5, 2.5, 2.5, 2.5, 25.0, 25.0, 2.5, ~
## $ PROPDMGEXP <chr> "K", "K", "K", "K", "K", "K", "K", "K", "K", "K", "M", "M",~
## $ CROPDMG    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ CROPDMGEXP <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",~

NAs:

naniar::miss_var_summary(dt[, ..relevant_column_names])
## # A tibble: 8 x 3
##   variable   n_miss pct_miss
##   <chr>       <int>    <num>
## 1 BGN_DATE        0        0
## 2 EVTYPE          0        0
## 3 FATALITIES      0        0
## 4 INJURIES        0        0
## 5 PROPDMG         0        0
## 6 PROPDMGEXP      0        0
## 7 CROPDMG         0        0
## 8 CROPDMGEXP      0        0

The 8 relevant columns do not contain any NAs.

Check for the existence of duplicate records:

suppressMessages(janitor::get_dupes(dt))
## Empty data.table (0 rows and 38 cols): STATE__,BGN_DATE,BGN_TIME,TIME_ZONE,COUNTY,COUNTYNAME...

There are no duplicates in “dt”.

Data summary by relevant variable:

skimr::skim_without_charts(dt[, ..relevant_column_names])
Data summary
Name dt[, ..relevant_column_na…
Number of rows 902297
Number of columns 8
Key NULL
_______________________
Column type frequency:
character 4
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
BGN_DATE 0 1 16 18 0 16335 0
EVTYPE 0 1 1 30 0 985 0
PROPDMGEXP 0 1 0 1 465934 19 0
CROPDMGEXP 0 1 0 1 618413 9 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
FATALITIES 0 1 0.02 0.77 0 0 0 0.0 583
INJURIES 0 1 0.16 5.43 0 0 0 0.0 1700
PROPDMG 0 1 12.06 59.48 0 0 0 0.5 5000
CROPDMG 0 1 1.53 22.17 0 0 0 0.0 990

Observations from the summary statistics:

  • The “EVTYPE” column holds 985 unique values, whereas the number of major weather events is significantly smaller (48). This points to naming inconsistencies.
  • The “PROPDMGEXP” and “CROPDMGEXP” columns also seem to have more unique values than what would be expected (that is, only “H” for hundreds, “K” for thousands, “M” for millions, and “B” for billions of U.S. dollars).
  • All four numeric columns follow a zero-inflated distribution, that is, a distribution that has a substantial spike at zero and a long right tail of non-zero values.

Unique values of the “PROPDMGEXP” column:

unique(dt$PROPDMGEXP)
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"

Unique values of the “CROPDMGEXP” column:

unique(dt$CROPDMGEXP)
## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"

4 Data Preprocessing

4.1 Description of the Preprocessing Pipeline

Steps in the Preprocessing Pipeline:

  1. Rename the Columns
  2. Preprocess the “BGN_DATE” column
    • Convert to Date
    • Filter out records before 01/01/1996
  3. Clean the “EVTYPE” Column
    • Remove trailing and leading whitespaces
    • Convert to title case
    • Use str_extract() for automatic and bulk transformation of most unique values of the column.
    • Use fcase() to manually group the unique values from the output of str_extract().
    • Filter out unique values associated with data entry errors or general terms.
  4. Create two derived columns holding the total property and agricultural damages
    • Multiply the damages by their multiplier, which is calculated by applying the “exponent_to_multiplier” function to the “PROPDMGEXP” and “CROPDMGEXP” columns.
  5. Remove the NAs step 4. produced.
  6. Group by the “EVTYPE” column and calculate summary statistics.

4.2 Note regarding the “BGN_DATE” column

Why filter out records before 01/01/1996?

  • Because NOAA didn’t begin recording all weather event types until that date.
  • More specifically, before 1996, only Tornado, Thunderstorm Wind, and Hail data were collected.
  • The recording of all other weather events presented in the storm data event table (p.6) began after 1996.
  • Considering that, including dates before 1996 could potentially skew the analysis toward these event types.

4.3 Supporting Objects

4.4 Preprocessing Pipeline

setnames(dt, relevant_column_names, 
         c("begin_date","event_type", "fatalities", "injuries", "property_damage", 
           "property_damage_exponent", "crop_damage", "crop_damage_exponent")
         )

invisible(dt[ 
        , `:=`(
                begin_date = as_date(mdy_hms(begin_date)),
                event_type = str_to_title(str_trim(event_type, side = "both"))
        )
])
results <- dt[
        begin_date >= mdy("01/01/1996")
][
        , event_type := str_extract(event_type, regex(pattern, ignore_case = T))
][
        , event_type := recode_event_type(event_type)
][
        !(event_type %chin% filter_out_vals)
][
        , `:=`(
                total_property_damage = 
                        property_damage * 
                        exponent_to_multiplier(property_damage_exponent),
                total_crop_damage = 
                        crop_damage * 
                        exponent_to_multiplier(crop_damage_exponent)
        )
][
        , .(
                count = .N,
                fatalities_per_type = sum(fatalities),
                injuries_per_type = sum(injuries),
                total_property_damage_per_type = sum(total_property_damage),
                total_crop_damage_per_type = sum(total_crop_damage)
        ), by = event_type
][
        , `:=`(
                health_consequences_per_type = fatalities_per_type + injuries_per_type,
                damages_per_type = 
                        total_property_damage_per_type + total_crop_damage_per_type 
        )
][
        order(-health_consequences_per_type, -damages_per_type)
]

4.5 Validation

Check whether the transformation of the “EVTYPE/”event_type” column was successful. There should only exist 20 unique values (levels later) of that column, all stemming from the application of the “recode_event_type” recoding function in the previous preprocessing pipeline.

unique(results$event_type)
##  [1] "Tornado"           "Flood"             "Heat"             
##  [4] "Thunderstorm Wind" "Lightning"         "Winter Storm"     
##  [7] "Wind"              "Hurricane/Typhoon" "Marine/Coastal"   
## [10] "Fire"              "Fog"               "Hail"             
## [13] "Cold"              "Dust Storm"        "Avalanche"        
## [16] "Heavy Rain"        "Landslide"         "Drought"          
## [19] "Volcanic Ash"      "Smoke"
glimpse(results)
## Rows: 20
## Columns: 8
## $ event_type                     <chr> "Tornado", "Flood", "Heat", "Thundersto~
## $ count                          <int> 32618, 79727, 2410, 217013, 13204, 3967~
## $ fatalities_per_type            <dbl> 1513, 1348, 2035, 389, 651, 530, 401, 1~
## $ injuries_per_type              <dbl> 20670, 8525, 7685, 5156, 4141, 3591, 15~
## $ total_property_damage_per_type <dbl> 24622817010, 164399831670, 9243700, 791~
## $ total_crop_damage_per_type     <dbl> 283425010, 6383043200, 492578500, 10169~
## $ health_consequences_per_type   <dbl> 22183, 9873, 9720, 5545, 4792, 4121, 19~
## $ damages_per_type               <dbl> 24906242020, 170782874870, 501822200, 8~
if_else(all(naniar::miss_var_summary(results)$n_miss) == 0, "No Missing Values", "NAs Present")
## [1] "No Missing Values"
if_else(nrow(suppressMessages(janitor::get_dupes(results))) == 0, "No Duplicates", "Duplicates Present") 
## [1] "No Duplicates"

5 Results

Convert the “results” data.table to tibble to facilitate further exploration, since “results” now only contains 20 rows and 8 columns:

results <- results |> as_tibble()

5.1 Health Consequences

Create the tibble that will hold the data for all health consequences:

health_consequences <- results |> 
        select(event_type, count, fatalities_per_type, injuries_per_type, 
               health_consequences_per_type) |> 
        arrange(desc(health_consequences_per_type)) |> 
        mutate(event_type = as_factor(event_type))
health_consequences
## # A tibble: 20 x 5
##    event_type         count fatalities_per_type injuries_per_type
##    <fct>              <int>               <dbl>             <dbl>
##  1 Tornado            32618                1513             20670
##  2 Flood              79727                1348              8525
##  3 Heat                2410                2035              7685
##  4 Thunderstorm Wind 217013                 389              5156
##  5 Lightning          13204                 651              4141
##  6 Winter Storm       39679                 530              3591
##  7 Wind               24547                 401              1506
##  8 Hurricane/Typhoon   1015                 182              1666
##  9 Marine/Coastal      8432                 740               906
## 10 Fire                4176                  87              1458
## 11 Fog                 1774                  69               855
## 12 Hail              208210                   7               723
## 13 Cold                3727                 380               130
## 14 Dust Storm           559                  13               415
## 15 Avalanche            378                 223               156
## 16 Heavy Rain         11859                 102               249
## 17 Landslide            620                  43                55
## 18 Drought             2634                   3                29
## 19 Volcanic Ash          30                   0                 0
## 20 Smoke                 21                   0                 0
## # i 1 more variable: health_consequences_per_type <dbl>
health_consequences |> 
        select(-count) |> 
        slice_head(n = 10) |> 
        pivot_longer(cols = c(fatalities_per_type, injuries_per_type), 
                     names_to = "component", 
                     values_to = "value") |> 
        mutate(component = factor(component, 
                                  levels = c("fatalities_per_type", "injuries_per_type"))) |> 
                                             # flipped in a horizontal bar chart
        ggplot(aes(x = value, y = fct_rev(event_type), fill = component)) +
        geom_bar(stat = "identity", position = "stack", width = 0.8) +
        geom_text(
                data = health_consequences |> slice_head(n = 10), # Original wide data (1 row per bar)
                aes(
                        x = health_consequences_per_type, 
                        y = fct_rev(event_type),
                        label = health_consequences_per_type,
                ),
                inherit.aes = F, # detach from parent aes() so fill & component don't interfere 
                size = 3.5, nudge_x = 650
        ) + 
        scale_x_continuous(expand = expansion(mult = c(0.01, 0.1)),
                           breaks = seq(from = 0, to = 25000, by = 2500)) +
        scale_fill_manual(name = "Harm Inflicted:",
                          values = c("fatalities_per_type" = "#0072B2", 
                                     "injuries_per_type" = "#D55E00"), # Blue & Vermillion
                          labels = c("fatalities_per_type" = "Fatalities", 
                                     "injuries_per_type" = "Injuries")
        ) + 
        labs(x = "Count of Harmed Individuals (Fatalities & Injuries)", y = "Weather Event Type") +
        theme_bw(base_size = 12) + 
        theme(
                axis.text = element_text(size = 12),
                legend.position = c(0.865,0.17),
                legend.background = element_blank(),
                legend.key = element_blank(),
                plot.margin = margin(6, 16, 6, 8),
                panel.grid.major.y = element_blank()
        )
Weather event types that are most harmful to population health across the U.S. from 1996-01-01 to 2011-11-30.

Weather event types that are most harmful to population health across the U.S. from 1996-01-01 to 2011-11-30.

Health Consequences Insights:

  • Tornadoes top the list of the most health consequences with a total number of 22183 fatalities and injuries.
  • Floods and heat follow with half the count of the “Tornado” event type with regard to fatalities and injuries. (9873 and 9720, respectively).
  • Thunderstorm winds, lightning, and winter storms are also dangerous, having harmed 5545, 4792, and 4121 people, respectively.
  • Heat, even though it is observed at least one order of magnitude less often (2410 records) than the other event types that top the list, is responsible for the most deaths out of all event types (2035); Tornadoes and Floods follow (1513 and 1348, respectively).

5.2 Economic Consequences

Create the tibble that will hold the data for all economic consequences:

economic_consequences <- results |> 
        select(event_type, count, total_property_damage_per_type, 
               total_crop_damage_per_type, damages_per_type) |> 
        mutate(
                total_property_damage_per_type = total_property_damage_per_type/1e9,
                total_crop_damage_per_type = total_crop_damage_per_type/1e9,
                damages_per_type = damages_per_type/1e9
                ) |> 
        arrange(desc(damages_per_type)) |> 
        mutate(event_type = as_factor(event_type))
economic_consequences
## # A tibble: 20 x 5
##    event_type         count total_property_damage_per_t~1 total_crop_damage_pe~2
##    <fct>              <int>                         <dbl>                  <dbl>
##  1 Flood              79727                     164.                     6.38   
##  2 Hurricane/Typhoon   1015                      89.4                    6.03   
##  3 Tornado            32618                      24.6                    0.283  
##  4 Hail              208210                      14.6                    2.50   
##  5 Drought             2634                       1.05                  13.4    
##  6 Thunderstorm Wind 217013                       7.91                   1.02   
##  7 Fire                4176                       7.76                   0.402  
##  8 Winter Storm       39679                       6.41                   0.121  
##  9 Wind               24547                       5.43                   0.716  
## 10 Cold                3727                       0.0417                 2.64   
## 11 Heavy Rain         11859                       0.600                  0.730  
## 12 Lightning          13204                       0.743                  0.00690
## 13 Heat                2410                       0.00924                0.493  
## 14 Landslide            620                       0.327                  0.0200 
## 15 Marine/Coastal      8432                       0.264                  0.0402 
## 16 Fog                 1774                       0.0226                 0      
## 17 Dust Storm           559                       0.00616                0.0031 
## 18 Avalanche            378                       0.00371                0      
## 19 Volcanic Ash          30                       0.0005                 0      
## 20 Smoke                 21                       0.0001                 0      
## # i abbreviated names: 1: total_property_damage_per_type,
## #   2: total_crop_damage_per_type
## # i 1 more variable: damages_per_type <dbl>
wide_economic_consequences <- economic_consequences |> slice_head(n = 10)

long_economic_consequences <- economic_consequences |> 
        select(-count) |> 
        slice_head(n = 10) |> 
        pivot_longer(cols = c(total_property_damage_per_type, total_crop_damage_per_type), 
                     names_to = "component",
                     values_to = "value") |> 
        mutate(component = factor(component, 
                                  levels = c("total_crop_damage_per_type", 
                                             "total_property_damage_per_type")))

long_economic_consequences |> 
        ggplot(aes(x = value, y = fct_rev(event_type), fill = component)) +
        geom_bar(stat = "identity", position = "stack", width = 0.8) +
        geom_text(
                data = wide_economic_consequences,
                aes(
                        x = damages_per_type,
                        y = fct_rev(event_type),
                        label = scales::dollar(damages_per_type, 
                                               accuracy = 0.1, suffix = "B")
                ),
                inherit.aes = F, size = 3.5, nudge_x = 6
        ) + 
        scale_x_continuous(expand = expansion(mult = c(0.01, 0.1)),
                           breaks = seq(from = 0, to = 180, by = 20),
                           labels = scales::label_dollar(suffix = "B")) +
        scale_fill_manual(name = "Economic Consequence:",
                          values = c("total_crop_damage_per_type" = "#0072B2", 
                                     "total_property_damage_per_type" = "#D55E00"),
                          labels = c("total_crop_damage_per_type" = "Crop Damage", 
                                     "total_property_damage_per_type" = "Property Damage")
        ) + 
        labs(x = "U.S. Dollars (Billions)", y = "Weather Event Type") +
        theme_bw(base_size = 12) + 
        theme(
                axis.text = element_text(size = 12),
                legend.position = c(0.832,0.17),
                legend.background = element_blank(),
                legend.key = element_blank(),
                plot.margin = margin(6, 15, 6, 8),
                panel.grid.major.y = element_blank()
        )
Weather event types with the greatest economic consequences across the U.S. from 1996-01-01 to 2011-11-30.

Weather event types with the greatest economic consequences across the U.S. from 1996-01-01 to 2011-11-30.

Economic Consequences Insights:

  • Floods are unquestionably the most harmful financially, causing a total of 170.8 billion dollars in damages (property and crop damages).
    • More specifically, the “Flood” type is second regarding the total crop damages it causes ($6.4B) and first with regard to the total property damages it is responsible for ($164.4B).
  • Hurricanes, even though they appear infrequently in the database (1015 records), are second in the list, resulting in 95.4 billion dollars of total damages.
  • The previous weather events are followed by tornadoes ($24.9B) and hail ($17.1B), and drought ($14.4B).
  • Even though drought is observed less often (2634 records) and it is the third least harmful event type for population health, it is the fifth most harmful financially, and it constitutes the most detrimental weather event type with regard to crop damages ($13.4B in crop damages).

6 Software Environment

sessionInfo()
## R version 4.5.3 (2026-03-11 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 10 x64 (build 17763)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## time zone: Europe/Athens
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] data.table_1.18.2.1 lubridate_1.9.5     forcats_1.0.1      
##  [4] stringr_1.6.0       dplyr_1.2.0         purrr_1.2.1        
##  [7] readr_2.1.6         tidyr_1.3.2         tibble_3.3.1       
## [10] ggplot2_4.0.2       tidyverse_2.0.0    
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       jsonlite_2.0.0     compiler_4.5.3     visdat_0.6.0      
##  [5] tidyselect_1.2.1   snakecase_0.11.1   jquerylib_0.1.4    scales_1.4.0      
##  [9] yaml_2.3.12        fastmap_1.2.0      R6_2.6.1           generics_0.1.4    
## [13] knitr_1.51         janitor_2.2.1      bslib_0.10.0       pillar_1.11.1     
## [17] RColorBrewer_1.1-3 tzdb_0.5.0         rlang_1.1.7        utf8_1.2.6        
## [21] stringi_1.8.7      cachem_1.1.0       xfun_0.56          sass_0.4.10       
## [25] S7_0.2.1           otel_0.2.0         timechange_0.4.0   cli_3.6.5         
## [29] withr_3.0.2        magrittr_2.0.4     digest_0.6.39      grid_4.5.3        
## [33] rstudioapi_0.18.0  hms_1.1.4          lifecycle_1.0.5    vctrs_0.7.1       
## [37] evaluate_1.0.5     glue_1.8.0         farver_2.1.2       naniar_1.1.0      
## [41] rmarkdown_2.31     tools_4.5.3        pkgconfig_2.0.3    htmltools_0.5.9