Investigating the severity of hazards in the NOAA Storm Database

Synopsis

In this study I will investigate the consequences of hazardous weather for the economy and public health of the U.S. To achieve this I analyzed the NOAA Storms Data base for the most severe events regarding injuries, fatalities and financial damage. For my public health analysis I first looked at the top 10 events that caused the most injuries and fatalities and then at the distribution of the top events in both categories. For the economic consequences I first calculated the total financial damage caused for every event and then proceeded the same way as with the injury and fatality numbers.

Data Processing

First I loaded my used libraries, which can be seen below:

library(tidyverse)
library(ggplot2)
library(data.table)

To load the data I first accessed the database here at the 02.07.2022.

#loading data
if(!file.exists("./FStormData.csv.bz2")){
    download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "./FStormData.csv.bz2")
}

I read in the downloaded file with tidyverses read_csv() and transformed the resulting tibble to a data.table object for faster processing.

stormdata = as.data.table(read_csv("./FStormData.csv.bz2"))

I processed the table by only selecting the necessary columns for my analysis. Then, from the columns for PROPDMG and PROPDMG as well as the corresponding columns for crop, I calculated the actual value of financial damage caused for the exponents “K”,“M”,“B”. In the cases of other exponents (like “+”,“?”,“h” etc.) I disregarded these values, since I cannot interpret these exponents and they are not explained in the Storm Data Documentation. Then I added the new column DAMAGE which is the total financial damage (sum of properties and crops). Lastly I removed the now no longer relevant columns describing financial damage.

#processing data
eventdata <- stormdata %>%
    select(EVTYPE, FATALITIES:CROPDMGEXP) %>%
    mutate(prop = case_when(
        PROPDMGEXP == "K" ~ PROPDMG*10^3,
        PROPDMGEXP == "M" ~ PROPDMG*10^6,
        PROPDMGEXP == "B" ~ PROPDMG*10^9,
        #all other abbreviations become NA since I cannot interpret them
    )) %>%
    mutate(crop = case_when(
        CROPDMGEXP == "K" ~ CROPDMG*10^3,
        CROPDMGEXP == "M" ~ CROPDMG*10^6,
        CROPDMGEXP == "B" ~ CROPDMG*10^9,
        #all other abbreviations become NA since I cannot interpret them
    )) %>%
    mutate(DAMAGE = prop+crop) %>%
    select(EVTYPE:INJURIES,DAMAGE)

Results

Public Health analysis

I looked at injuries and fatalities separately since I think it would not be reasonable to try and weigh injuries vs fatalities for a combined metric for public health consequences.

So first I will take a look at the top 10 events for most injuries and most fatalities. To achieve this I grouped my data by EVTYPE and then summed up all injuries and fatalities.

#public health analysis
phdata <- eventdata %>% 
    group_by(EVTYPE) %>% 
    summarise(TOTAL_INJURIES = sum(INJURIES, na.rm = TRUE),
              TOTAL_FATALITIES = sum(FATALITIES, na.rm = TRUE),
              )

Here are the top 10 events for most injuries:

arrange(phdata, desc(TOTAL_INJURIES))[1:10, c("EVTYPE", "TOTAL_INJURIES")]

## # A tibble: 10 × 2
##    EVTYPE            TOTAL_INJURIES
##    <chr>                      <dbl>
##  1 TORNADO                    91346
##  2 TSTM WIND                   6957
##  3 FLOOD                       6789
##  4 EXCESSIVE HEAT              6525
##  5 LIGHTNING                   5230
##  6 HEAT                        2100
##  7 ICE STORM                   1975
##  8 FLASH FLOOD                 1777
##  9 THUNDERSTORM WIND           1488
## 10 HAIL                        1361

We can see that Tornadoes cause by far the most injuries and the rank 2 event is also wind/storm related. Floods and flash floods which describe similar events are both ranked in the top 10 and together would be ranked 2nd overall, but since I am not an expert on meteorology and don’t know the difference I will treat them separately. Also after the first 5 events there is quite a big gap so I decided to look at the distribution of the first 5 event types. I also only included events in the distribution that actually caused injuries since events in non-populated areas would drag these distributions down a lot.

topinjuryevents = arrange(phdata, desc(TOTAL_INJURIES))[1:5, c("EVTYPE")][[1]]
ggplot(
    data = filter(eventdata, EVTYPE %in% topinjuryevents & INJURIES > 0),
    aes(x = EVTYPE, y = INJURIES)
) + geom_boxplot() + scale_y_log10()

We can see that the high number of injuries from tornadoes is mostly caused by the high statistical outliers and the average heatwave is actually causes more injuries than the average tornado. I would assume this is likely due to that fact that heatwaves cover larger areas than a tornado, even though they probably are considered less dangerous intuitively. Floods seem to have on average a similar effect to tornadoes but just less high outliers. Lightning has a similar effect to tornadoes: The median Lightning event only causes one injury but the high outliers produce the high total of injuries. Same goes for TSTM Wind.

Now here are the top 10 events for most fatalities:

arrange(phdata, desc(TOTAL_FATALITIES))[1:10, c("EVTYPE", "TOTAL_FATALITIES")]

## # A tibble: 10 × 2
##    EVTYPE         TOTAL_FATALITIES
##    <chr>                     <dbl>
##  1 TORNADO                    5633
##  2 EXCESSIVE HEAT             1903
##  3 FLASH FLOOD                 978
##  4 HEAT                        937
##  5 LIGHTNING                   816
##  6 TSTM WIND                   504
##  7 FLOOD                       470
##  8 RIP CURRENT                 368
##  9 HIGH WIND                   248
## 10 AVALANCHE                   224

Again tornadoes are ranked highest, even though the gap to second place is less here. Also we can see that 2nd and 4th ranked event types are both heat (just entered differently in the database). In sum they still would not cause more deaths than tornadoes. Flash floods and floods are also describing similar events. In sum they would not outrank the flash flood itself though so I will treat them separately again. Also I decided that the gap between rank 5 and 6 is big enough again to only look at the distributions of the 5 highest ranking event types.

topfatalityevents = arrange(phdata, desc(TOTAL_FATALITIES))[1:5, c("EVTYPE")][[1]]
ggplot(
    data = filter(eventdata, EVTYPE %in% topfatalityevents & FATALITIES > 0),
    aes(x = EVTYPE, y = FATALITIES)
) + geom_boxplot()+ scale_y_log10()

What surprises me the most here is that none of these events differ in median fatalities caused and the median of 1 is the lowest value possible since we only looked at events that actually caused fatalities, so the only difference in total fatalities is caused by the higher values of the distribution. We can see that the 75th percentile is highest for tornadoes, and then even across both heat event types and flash floods. Lightning only had 4 events that caused more than the minimum of 1 fatality. Even though in general tornadoes seem to cause more deaths, the by far most deadly event shown here was actually a heat event and I want to point out again that there are two events here describing heat.

Economic analysis

The methodology I used after summing up the total financial damage caused by the events does not differ from what I did in the public health analysis.

Here are the top 10 even for most financial damage:

#economic analysis
econdata <- eventdata %>% 
    group_by(EVTYPE) %>% 
    summarise(TOTAL_DAMAGE = sum(DAMAGE, na.rm = TRUE))

arrange(econdata, desc(TOTAL_DAMAGE))[1:10, c("EVTYPE", "TOTAL_DAMAGE")]

## # A tibble: 10 × 2
##    EVTYPE            TOTAL_DAMAGE
##    <chr>                    <dbl>
##  1 FLOOD             138007444500
##  2 HURRICANE/TYPHOON  29348167800
##  3 TORNADO            16520148150
##  4 HURRICANE          12405268000
##  5 RIVER FLOOD        10108369000
##  6 HAIL               10019978590
##  7 FLASH FLOOD         8715295130
##  8 ICE STORM           5925147300
##  9 STORM SURGE/TIDE    4641493000
## 10 THUNDERSTORM WIND   3813647990

We can see that floods are causing by far the highest financial damage out of all event types. In addition 3 of the top 10 event types are flood related. Since there are big gaps in total damage in between all of the first 4 events, I decided to only look at the distribution of the 3 highest ranking event types.

topdamageevents = arrange(econdata, desc(TOTAL_DAMAGE))[1:3, c("EVTYPE")][[1]]
ggplot(
    data = filter(eventdata, EVTYPE %in% topdamageevents & DAMAGE > 0),
    aes(x = EVTYPE, y = DAMAGE)
) + geom_boxplot()+ scale_y_log10()

We can see that on average a Hurricane/Typhoon causes actually more damage than floods and tornadoes, which are very similar, but for both there are a lot of high statistical outliers. In fact there apparently was a single flood event that caused by far the most financial damage out of all the events we looked at. I did not look at the numbers again disregarding this maximum outlier, since I think it would not be logical to disregard it in hope that another flood like this has a low probability just based on this exploratory data analysis. Also I would suspect that the frequency of floods in comparison to hurricanes/typhoons makes up the high amount of total financial damage caused by floods.

Conclusion

In regards of public health I would based on this exploratory data analysis describe tornadoes and heatwaves as the most severe weather hazards. For economic consequences I would claim that floods are most severe but hurricanes, typhoons and tornadoes cannot be disregarded as economical dangers. In addition I would like to mention that for economical consequences a couple of resulting costs are not tracked in the database I accessed and for example the public health consequences probably also effect the economy indirectly or the destruction of working facilities could cause likely higher economical damage than “just” the financial value that has been destroyed.