This study uses the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, with data from 1950 up to 2011, and aims to figure the most harmful weather events in the US. It considers the fatalities, injuries, property damages and crop damages resulted from weather events. Note that, although the database has data from 1950 to 2011, only the events from 2000 onward were considered.

Setup

Data

The dataset used in this study can be downloaded, in a compressed format: Storm Data

There is some documentation that it´s also available:

Packages Used

library(dplyr, warn.conflicts = F, quietly = T)
library(readr, warn.conflicts = F, quietly = T)
library(ggplot2, warn.conflicts = F, quietly = T)
library(lubridate, warn.conflicts = F, quietly = T)

Data Processing

Reading data

The only columns that were loaded from the dataset where:

  • BGN_DATE: To filter out the older entries.
  • EVTYPE: The type of event.
  • FATALITIES: The number of fatalities caused by the event.
  • INJURIES: The number of injuries caused by the event.
  • PROPDMG, PROPDMGEXP: Columns used to determine the economic damage on properties.
  • CROPDMG, CROPDMGEXP: Columns used to determine the economic damage on crops.
storm_data <- readr::read_csv("data/StormData.csv.bz2",
                              col_types = readr::cols(
                                .default = readr::col_skip(),
                                BGN_DATE = readr::col_date(format= "%m/%d/%Y %T"),
                                EVTYPE = readr::col_factor(),
                                FATALITIES = readr::col_number(),
                                INJURIES = readr::col_number(),
                                PROPDMG = readr::col_double(),
                                PROPDMGEXP = readr::col_character(),
                                CROPDMG = readr::col_double(),
                                CROPDMGEXP = readr::col_character()
                              ))

Removing the Older Data

It´s possible to observe that there are less records on the earlier years of the dataset than on the newer ones. In this study, only the records from 2000 onward were considered. This reduces the time period under analysis and keep the most recent records. It also makes the analysis more accurately reflect the effect of the events considering the actual technology and climate situation.

recent_storms <- dplyr::filter(storm_data, lubridate::year(BGN_DATE) >= 2000)

The new dataset has only 42% of the rows of the original dataset. Still, it’s 523163 of recent data.

ggplot(storm_data, aes( x = lubridate::year(BGN_DATE),
                        fill = lubridate::year(BGN_DATE) >= 2000)) + 
  geom_bar(position = position_nudge(x = 0.5)) +
      scale_fill_manual(labels = c("Removing","Keeping"), values = c("#971911", "#024B96") ) +
  labs(x = "Year", y="Events in Year", fill = "Number of Events By Year") +
  theme_bw() + 
  ggtitle("Number of Events By Year") + 
  geom_vline(xintercept = 2000, color = "#04274A", linetype="dashed") +
  scale_x_continuous(breaks=seq(1950,2011, 5)) +
  theme(axis.text.x = element_text(angle=45, vjust=0.5))

## Aggregating

In this study, health damage will be defined as the damage related to the number of injuries and the number of fatalities. The number of fatalities will be weighted 1000 more than the number of injuries.

The economic damage will be calculated based on the property and crops damage, with equal weights for both. Note that the values for this damages is represented by the number in the CROPDMG/PROPDMG and the letter in the CROPDMGEXP/PROPDMGXP, where:

  • K = thousands of dollars.
  • M = millions.
  • B = billions.
costs_of_storms <- dplyr::group_by(recent_storms, EVTYPE) %>%
                   dplyr::summarise(health_damage = sum(1000*FATALITIES + INJURIES),
                                    corrected_cropdmg = sum(dplyr::case_when(
                                      CROPDMGEXP == "K" ~ 1000*CROPDMG,
                                      CROPDMGEXP == "M" ~ 1e3*CROPDMG,
                                      CROPDMGEXP == "B" ~ 1e6*CROPDMG,
                                      TRUE ~ CROPDMG
                                      )),
                                    corrected_propdmg = sum(dplyr::case_when(
                                      PROPDMGEXP == "K" ~ 1000*PROPDMG,
                                      PROPDMGEXP == "M" ~ 1e3*PROPDMG,
                                      PROPDMGEXP == "B" ~ 1e6*PROPDMG,
                                      TRUE ~ PROPDMG
                                      )),
                                    economic_damage = sum(corrected_cropdmg + corrected_propdmg))

Results

Based on the costs_of_storms, it is possible to plot the top 5 most health harmful effects from 2000 to 2011:

ggplot(costs_of_storms %>% arrange(desc(health_damage)) %>% head(5), aes(x = reorder(EVTYPE, -health_damage), y = health_damage)) + geom_col() +
  labs(y = "Health Damages", x = "Event Type", title = "Health Harmful Events")

In the economic damages in the same period, it´s possible to note the Tornado and Flash Flood events appearing in the top 5 most economic harmful events.

ggplot(costs_of_storms %>% arrange(desc(economic_damage)) %>% head(5), aes(x = reorder(EVTYPE, -economic_damage), y = economic_damage)) + geom_col() +
  labs(y = "Health Damages", x = "Event Type", title = "Health Harmful Events")

Conclusions

Altough this analysis is very simple, it was possible to identify that the most harmful events in the US, in the period between 2000 and 2011, were: Tornado and Flash Flood, causing great damages in both, health and economy.

A more thorough analysis could’ve assessed the variation of the distribution of damages over the years, or over the different states of US, providing a deeper understanding of the effects.