1 Synopsis

This analysis evaluates the impacts of severe weather events on both population health and economic damages, using data from the NOAA Storm Database. Tornadoes account for the highest number of injuries (26,767) and fatalities (1,758), making them the most detrimental event to public health. Thunderstorm Winds dominate property damage with $3.74M in losses, while Hail is the leading contributor to crop damage at $581K. Floods emerge as a dual threat, causing significant health impacts (8,683 injuries, 1,553 fatalities) and substantial economic damage ($2.46M in property and $367K in crop losses). These findings highlight the diverse impacts of different weather events, offering valuable insights for resource prioritization.

2 Data Preprocessing

First we read the dataset, and then identify which metrics we’ll be working with:

storm_data <- read.csv("StormData.csv.bz2")

names(storm_data)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

To ensure accuracy in the analysis, the dataset requires preprocessing. The weather event types (EVTYPE) column appears to have many inconsistencies in naming, formatting, and misspellings. Just to give you an example, here’s what happens when we pull a list of all the unique character values appearing at least once in the EVTYPE column:

unique_evtype <- storm_data %>%
  distinct(EVTYPE) %>%
  pull(EVTYPE)

summary(unique_evtype)
##    Length     Class      Mode 
##       985 character character

Wow! Of the 902,297 rows in the dataset, there are nearly 1,000 unique names for weather event types. For a deeper look, let’s count the incidence of each word in unique_evtype:

word_counts <- tibble(event_type = unique_evtype) %>%
  unnest_tokens(word, event_type) %>%
  count(word, sort = TRUE)

print(word_counts, n = 20)
## # A tibble: 444 × 2
##    word             n
##    <chr>        <int>
##  1 wind           138
##  2 snow           119
##  3 winds           93
##  4 heavy           83
##  5 flood           78
##  6 high            77
##  7 thunderstorm    74
##  8 summary         67
##  9 rain            65
## 10 and             56
## 11 hail            44
## 12 of              44
## 13 cold            41
## 14 record          41
## 15 flooding        37
## 16 ice             35
## 17 urban           33
## 18 storm           32
## 19 freezing        29
## 20 tstm            29
## # ℹ 424 more rows

Now we’re getting somewhere! As you can see, there’s a lot of similarities within the weather event types that we observed in unique_evtype. That means many of these unique values can be merged so that when we run our calculations to figure out the impact of particular weather events on human health and the economy, we’ll arrive at a much more accurate conclusion. Let’s begin to group similar weather events using the word_counts tibble that we just created:

wind_related <- unique_evtype[str_detect(unique_evtype, regex("wind|tstm|thunderstorm|winds|thunderstorms|lightning|thunderstrom", ignore_case = TRUE))]
flood_related <- unique_evtype[str_detect(unique_evtype, regex("flood|flash|flooding|stream", ignore_case = TRUE))]
winter_related <- unique_evtype[str_detect(unique_evtype, regex("snow|blizzard|ice|winter|cold|freezing|frost|freeze|snowfall", ignore_case = TRUE))]
heat_related <- unique_evtype[str_detect(unique_evtype, regex("hot|heat", ignore_case = TRUE))]
tornado_related <- unique_evtype[str_detect(unique_evtype, regex("tornado|tornadoes|funnel|funnels|waterspout", ignore_case = TRUE))]
hail_related <- unique_evtype[str_detect(unique_evtype, regex("hail", ignore_case = TRUE))]
hurricane_related <- unique_evtype[str_detect(unique_evtype, regex("hurricane|typhoon", ignore_case = TRUE))]
landslide_related <- unique_evtype[str_detect(unique_evtype, regex("landslide|landslides|landslump|mud|mudslide|mudslides", ignore_case = TRUE))]
rain_related <- unique_evtype[str_detect(unique_evtype, regex("rain|rainfall|rains|drizzle", ignore_case = TRUE))]
drought_related <- unique_evtype[str_detect(unique_evtype, regex("drought|dryness|dry", ignore_case = TRUE))]

storm_data <- storm_data %>%
  mutate(event_type_clean = case_when(
    EVTYPE %in% wind_related ~ "Thunderstorm Wind",
    EVTYPE %in% flood_related ~ "Flood",
    EVTYPE %in% winter_related ~ "Winter Storm",
    EVTYPE %in% heat_related ~ "Excessive Heat",
    EVTYPE %in% tornado_related ~ "Tornado",
    EVTYPE %in% hail_related ~ "Hail",
    EVTYPE %in% hurricane_related ~ "Hurricane/Typhoon",
    EVTYPE %in% landslide_related ~ "Landslide",
    EVTYPE %in% rain_related ~ "Heavy Rain",
    EVTYPE %in% drought_related ~ "Drought",
    TRUE ~ "Other"
  ))

storm_data %>%
  count(event_type_clean, sort = TRUE)
##     event_type_clean      n
## 1  Thunderstorm Wind 380765
## 2               Hail 289274
## 3              Flood  86122
## 4            Tornado  71526
## 5       Winter Storm  44877
## 6         Heavy Rain  11845
## 7              Other  11493
## 8            Drought   2785
## 9     Excessive Heat   2666
## 10         Landslide    646
## 11 Hurricane/Typhoon    298

As you can see, we created 10 different groups, each related to a major weather event, by merging and then renaming similar unique character values in the EVTYPE column. Next, we create a new dataset, keeping only the columns that are most important to our analysis: Event Types, Fatalities, Injuries, Crop Damage, Property Damage, and Begin Date. We only included values from 1990-2011 because a majority of the weather event types were not recorded until the 1990s, therefore any weather event recorded since the 1950s, like tornadoes, would have skewed the analysis.

storm_data_by_decade <- storm_data %>%
  mutate(
    BGN_DATE = mdy_hms(BGN_DATE),
    decade = case_when(
      year(BGN_DATE) >= 1990 & year(BGN_DATE) < 2000 ~ "1990s",
      year(BGN_DATE) >= 2000 & year(BGN_DATE) <= 2011 ~ "2000s"
    )
  ) %>%
  group_by(event_type_clean, decade) %>%
  summarize(
    FATALITIES = sum(FATALITIES, na.rm = TRUE),
    INJURIES = sum(INJURIES, na.rm = TRUE),
    CROPDMG = sum(CROPDMG, na.rm = TRUE),
    PROPDMG = sum(PROPDMG, na.rm = TRUE),
    .groups = "drop"
  ) %>%
 filter(event_type_clean != "Other") %>%
  arrange(desc(FATALITIES + INJURIES)) %>%
  mutate(
    CROPDMG = dollar(CROPDMG),
    PROPDMG = dollar(PROPDMG)
  )

storm_data_by_decade<-drop_na(storm_data_by_decade)

print(storm_data_by_decade)
## # A tibble: 20 × 6
##    event_type_clean  decade FATALITIES INJURIES CROPDMG  PROPDMG   
##    <chr>             <chr>       <dbl>    <dbl> <chr>    <chr>     
##  1 Tornado           2000s        1195    15214 $73,635  $912,500  
##  2 Tornado           1990s         563    11553 $26,392  $689,482  
##  3 Thunderstorm Wind 2000s        1219     7281 $137,369 $2,390,820
##  4 Thunderstorm Wind 1990s         873     7553 $90,225  $1,349,686
##  5 Flood             1990s         671     7526 $113,814 $773,990  
##  6 Excessive Heat    1990s        1894     4294 $802     $1,805    
##  7 Excessive Heat    2000s        1244     4930 $671     $1,428    
##  8 Winter Storm      1990s         577     4899 $12,621  $161,968  
##  9 Flood             2000s         882     1157 $253,323 $1,687,671
## 10 Winter Storm      2000s         289     1378 $9,830   $254,852  
## 11 Hurricane/Typhoon 2000s          68     1291 $6,226   $11,926   
## 12 Hail              1990s          10      604 $217,435 $236,703  
## 13 Hail              2000s           5      545 $363,984 $452,607  
## 14 Heavy Rain        2000s          37      167 $9,534   $39,576   
## 15 Heavy Rain        1990s          63      113 $2,914   $13,932   
## 16 Hurricane/Typhoon 1990s          65       42 $5,401   $13,260   
## 17 Landslide         2000s          37       52 $37      $19,270   
## 18 Drought           1990s          30        9 $7,048   $1,396    
## 19 Drought           2000s           2       23 $26,866  $4,441    
## 20 Landslide         1990s           7        3 $0       $1,343

3 Results

3.1 Events Most Harmful to Population Health

The goal of this section is to identify weather events most harmful to human population health. In order to answer this question, we refine dataset by aggregating by event type and removing the decades column. Then, we filter out droughts and landslides because while their economic impact is clear, their incidences of human injuries and fatalities is very low. We’re left with a more solid, clean dataset of the eight weather events most harmful to population health between 1990-2011:

storm_data_aggregated <- storm_data_by_decade %>%
  group_by(event_type_clean) %>%
  summarize(
    FATALITIES = sum(FATALITIES, na.rm = TRUE),
    INJURIES = sum(INJURIES, na.rm = TRUE),
    CROPDMG = sum(as.numeric(gsub("[\\$,]", "", CROPDMG)), na.rm = TRUE),
    PROPDMG = sum(as.numeric(gsub("[\\$,]", "", PROPDMG)), na.rm = TRUE),
    .groups = "drop"
  ) %>%
  filter(!event_type_clean %in% c("Drought", "Landslide")) %>%
  select(-CROPDMG, -PROPDMG)

print(storm_data_aggregated)
## # A tibble: 8 × 3
##   event_type_clean  FATALITIES INJURIES
##   <chr>                  <dbl>    <dbl>
## 1 Excessive Heat          3138     9224
## 2 Flood                   1553     8683
## 3 Hail                      15     1149
## 4 Heavy Rain               100      280
## 5 Hurricane/Typhoon        133     1333
## 6 Thunderstorm Wind       2092    14834
## 7 Tornado                 1758    26767
## 8 Winter Storm             866     6277

Next, let’s observe the impact of these eight weather events on human health:

storm_data_long <- storm_data_aggregated %>%
  pivot_longer(cols = c(FATALITIES, INJURIES), names_to = "Impact_Type", values_to = "Count")

ggplot(data = storm_data_long, aes(x = Count, y = reorder(event_type_clean, -Count), fill = Impact_Type)) +
  geom_segment(aes(xend = 0, yend = event_type_clean), color = "grey37", size = 0.8) +
  geom_point(size = 5, shape = 21, color = "black") +
  scale_fill_manual(values = c("FATALITIES" = "wheat1", "INJURIES" = "slategray1")) +
  scale_x_continuous(
    breaks = seq(0, 27500, by = 2500),
    limits = c(0, 27500),
    expand = c(0, 0)
  ) +
  coord_cartesian(clip = "off") +
  labs(
    title = "Impacts of Weather Event Types on Human Health",
    subtitle = "Comparison of Fatalities and Injuries by Event Type, 1990-2011",
    x = "Count",
    y = "Weather Event Type",
    fill = "Impact Type"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
    axis.text.y = element_text(size = 10, margin = margin(r = 11)),
    axis.title.x = element_text(size = 12),
    axis.title.y = element_text(size = 12),
    plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 11),
    legend.position = "top",
    plot.margin = margin(10, 20, 10, 20)
  )

kable(storm_data_aggregated, caption = "Top 8 Weather Events Impacting Population Health, 1990-2011") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Top 8 Weather Events Impacting Population Health, 1990-2011
event_type_clean FATALITIES INJURIES
Excessive Heat 3138 9224
Flood 1553 8683
Hail 15 1149
Heavy Rain 100 280
Hurricane/Typhoon 133 1333
Thunderstorm Wind 2092 14834
Tornado 1758 26767
Winter Storm 866 6277

As you can see from the chart and the corresponding table, tornadoes emerge as the most harmful event type to population health, causing significantly more fatalities and injuries than any other event. It’s also worth restating the importance of having merged similar EVTYPE values earlier. Had I skipped preprocessing and done my calculations on the raw dataset, thunderstorm winds would have ranked significantly lower in the analysis because of all the naming inconsistencies in the raw dataset. Also, tornadoes would have heavily skewed the final analysis because most of these other weather event types were not recorded in the NOAA storm database prior to the 1990s.

3.2 Weather Events with the Greatest Economic Consequences

Let’s move onto the next question in our analysis, which seeks to know which weather events have the greatest econonmic consequences. To analyze the economic impacts of weather events, we return to our storm_data_by_decade, which included property and crop damages, so that we can highlight the types of losses (infrastructure vs. agriculture) associated with each event.

economic_data <- storm_data_by_decade %>%
  select(-FATALITIES, -INJURIES) %>%
  mutate(
    CROPDMG = as.numeric(gsub("[\\$,]", "", CROPDMG)),
    PROPDMG = as.numeric(gsub("[\\$,]", "", PROPDMG))
  ) %>%
  group_by(event_type_clean) %>%
  summarize(
    CROPDMG = sum(CROPDMG, na.rm = TRUE),
    PROPDMG = sum(PROPDMG, na.rm = TRUE),
    .groups = "drop"
  )

print(economic_data)
## # A tibble: 10 × 3
##    event_type_clean  CROPDMG PROPDMG
##    <chr>               <dbl>   <dbl>
##  1 Drought             33914    5837
##  2 Excessive Heat       1473    3233
##  3 Flood              367137 2461661
##  4 Hail               581419  689310
##  5 Heavy Rain          12448   53508
##  6 Hurricane/Typhoon   11627   25186
##  7 Landslide              37   20613
##  8 Thunderstorm Wind  227594 3740506
##  9 Tornado            100027 1601982
## 10 Winter Storm        22451  416820

After refining our data into a new dataset named economic_data as seen above, we now have a table of weather event types and their corresponding infrastructural and agricultural damage between 1990-2011. Now let’s map this out:

economic_data_prop <- economic_data %>%
  mutate(Total_Damage = CROPDMG + PROPDMG) %>%
  pivot_longer(cols = c(CROPDMG, PROPDMG), names_to = "Damage_Type", values_to = "Amount")

ggplot(economic_data_prop, aes(x = reorder(event_type_clean, -Total_Damage), y = Amount, fill = Damage_Type)) +
  geom_bar(stat = "identity", position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_manual(values = c("CROPDMG" = "wheat1", "PROPDMG" = "slategrey"), name = "Damage Type") +
  labs(
    title = "Proportion of Crop vs Property Damage by Weather Event Type",
    subtitle = "Percentage of Total Damage, 1990-2011",
    x = "Weather Event Type",
    y = "Proportion of Damage",
    fill = "Damage Type"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
    legend.position = "top"
  )

As we can see from the chart above and the corresponding table below, thunderstorm wind causes the highest property damage, while hail contributes the most to crop damage. Floods also represent a very significant economic burden.

economic_data %>%
  mutate(
    CROPDMG = scales::comma(CROPDMG),
    PROPDMG = scales::comma(PROPDMG)
  ) %>%
  kable(
    caption = "Economic Impact by Event Type: Crop vs Property Damage",
    col.names = c("Event Type", "Crop Damage (USD)", "Property Damage (USD)"),
    align = c("l", "r", "r"),
    format = "html"
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center"
  ) %>%
  column_spec(1, bold = TRUE)
Economic Impact by Event Type: Crop vs Property Damage
Event Type Crop Damage (USD) Property Damage (USD)
Drought 33,914 5,837
Excessive Heat 1,473 3,233
Flood 367,137 2,461,661
Hail 581,419 689,310
Heavy Rain 12,448 53,508
Hurricane/Typhoon 11,627 25,186
Landslide 37 20,613
Thunderstorm Wind 227,594 3,740,506
Tornado 100,027 1,601,982
Winter Storm 22,451 416,820

4 Limitations

While this analysis identifies key weather events and their impacts, certain limitations must be considered. Economic damage estimates are approximate and may not reflect full losses, which is why damage estimates may seem low in some cases. This could be due to data collection challenges, lack of inflation adjustment, or reporting bias. Additionally, fatalities and injuries are aggregated without distinguishing direct from indirect causes, which could provide further insights. Improved data collection and reporting standards would strengthen the reliability of these findings.

5 Conclusion

The findings reveal that tornadoes impose the greatest burden on population health, accounting for 65% of total injuries and 21% of fatalities among all analyzed event types. Thunderstorm Winds lead in property damage, contributing 37% of total property losses, while Hail is the top contributor to crop damage at 40% of total agricultural losses. Floods stand out as a significant concern for both public health and economic stability, ranking second in injuries and fatalities while also contributing heavily to crop and property damage. Events like Excessive Heat also pose notable health risks, with 3,138 fatalities and 9,224 injuries, emphasizing their impact on vulnerable populations. This analysis underscores the varied nature of weather-related threats, providing a comprehensive understanding to aid in preparing for future events.

6 References

  1. National Oceanic and Atmospheric Administration (NOAA). NOAA Storm Events Database.
    URL: https://www.ncdc.noaa.gov/stormevents/

  2. R Core Team (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL: https://www.R-project.org/

  3. Packages used in this analysis:

    • dplyr: Hadley Wickham, François Romain, Lionel Henry, and Kirill Müller (2023). dplyr: A Grammar of Data Manipulation. URL: https://dplyr.tidyverse.org/
    • ggplot2: Hadley Wickham, Winston Chang, and others (2023). ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. URL: https://ggplot2.tidyverse.org/
    • viridis: Simon Garnier (2023). viridis: Default Color Maps from ‘matplotlib’. URL: https://sjmgarnier.github.io/viridis/
    • kableExtra: Hao Zhu (2021). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. URL: https://haozhu233.github.io/kableExtra/
    • data.table: Matt Dowle and Arun Srinivasan (2023). data.table: Extension of ‘data.frame’. URL: https://rdatatable.gitlab.io/data.table/
    • scales: Wickham, H., & Seidel, D. (2023). scales: Scale Functions for Visualization. R package version 1.2.1. Available at: https://CRAN.R-project.org/package=scales
    • tidyverse: Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T., Miller, E., Bache, S., Müller, K., Ooms, J., Robinson, D., Paige Seidel, A., Spinu, V., Takahashi, K., Vaughan, D., Wilke, C., Woo, K., & Yutani, H. (2019). Welcome to the Tidyverse. Journal of Open Source Software, 4(43), 1686. DOI: 10.21105/joss.01686
    • tidyr: Wickham, H. and Girlich, M. (2023). tidyr: Tidy Messy Data. R package version 1.3.0. Available at: https://tidyr.tidyverse.org/
    • tidytext: Julia Silge and David Robinson (2016). tidytext: Text Mining and Analysis Using Tidy Data Principles in R. Journal of Open Source Software, 1(3), 37. DOI: 10.21105/joss.00037