Most Damaging Weather: Exploratory Analysis of NWS Storm Data

Synopsis

The NWS data set contains detailed event logs of severe weather events in the United States. Aggregating across types of events (Tornado, Flash Flood, Heavy Rain, etc) reveals that Tornadoes have caused the most fatalities - more than double the #2 severe weather event, Excessive Heat. However, when we examine the most deadly events by their average fatalities per event (not aggregates), Heat Waves, Excessive Heat, and Heat make up the top 3 most-deadly events - perhaps because Heat Waves do not generally get named and reported as such unless there is some loss of life, and because there have been some notorious heat waves (like the one in Chicago) that claimed hundreds of lives, pushing up the mean. When examining severe weather events from an economic damage perspective, Tornadoes are still the worst, in aggregate, but floods/flash floods make an entrance in the top 5.

Data Processing

Libraries in their own code block so we can cache the data loading step:

library(magrittr)
library(dplyr)
library(R.utils)
library(ggplot2)
library(scales)

Read in the .bz2 zipped file from the web, unzip and load the csv.

tdir <- tempdir()
tname <- tempfile(pattern = "storms", tmpdir = tdir, fileext = ".bz2")
download.file('https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2', destfile = tname, mode = "wb", quiet = TRUE)
bunzip2(tname, "storms.csv", remove = FALSE, skip = TRUE)

## [1] "storms.csv"
## attr(,"temporary")
## [1] FALSE

storms <- read.csv('storms.csv')

Group by event type and calculate aggregate statistics:

storms_by_event <- storms %>%
  group_by(EVTYPE) %>%
  summarize(
    total_propdmg = sum(PROPDMG, na.rm = TRUE) %>% round(1),
    mean_propdmg = mean(PROPDMG, na.rm = TRUE) %>% round(2),
    total_injuries = sum(INJURIES, na.rm = TRUE),
    mean_injuries = mean(INJURIES, na.rm = TRUE) %>% round(2),
    total_fatalities = sum(FATALITIES, na.rm = TRUE),
    mean_fatalities = mean(FATALITIES, na.rm = TRUE) %>% round(2),
    n = n()
  )
  
storms_by_event

## Source: local data frame [985 x 8]
## 
##                   EVTYPE total_propdmg mean_propdmg total_injuries
##                   (fctr)         (dbl)        (dbl)          (dbl)
## 1     HIGH SURF ADVISORY           200          200              0
## 2          COASTAL FLOOD             0            0              0
## 3            FLASH FLOOD            50           50              0
## 4              LIGHTNING             0            0              0
## 5              TSTM WIND           108           27              0
## 6        TSTM WIND (G45)             8            8              0
## 7             WATERSPOUT             0            0              0
## 8                   WIND             0            0              0
## 9                      ?             5            5              0
## 10       ABNORMAL WARMTH             0            0              0
## ..                   ...           ...          ...            ...
## Variables not shown: mean_injuries (dbl), total_fatalities (dbl),
##   mean_fatalities (dbl), n (int).

Results

There are a lot of one-off events in our EVTYPE records. To make conclusions about weather trends, we need to put some sort of sensible filter in place.

storms_by_event$n %>% quantile(c(.1, .5, .6, .7, .8, .9, .95))

##   10%   50%   60%   70%   80%   90%   95% 
##   1.0   2.0   2.0   3.8   7.0  35.2 258.8

mask <- table(storms$EVTYPE) %>% unlist() %>% unname() > 35
filtered_events <- table(storms$EVTYPE)[mask] %>% names()
mean(storms$EVTYPE %in% filtered_events)

## [1] 0.9967416

only 10% of the events listed occurred more than 35 times in the database. That seems like a reasonable cutoff point for trimming down the list of events - although we are discarding 90% of the unique event strings in EVTYPE, but we are keeping 99.7% of the rows in the data frame - much, much cleaner, and more suitable for analysis.

Human Health

highest_avg_deaths <- storms_by_event %>%
  filter(EVTYPE %in% filtered_events) %>%
  arrange(mean_fatalities %>% desc()) %>%
    mutate(
    rank_fatalities = order(mean_fatalities %>% desc())
  ) %>%
  as.data.frame()

most_deaths <- storms_by_event %>%
  filter(EVTYPE %in% filtered_events) %>%
  arrange(total_fatalities %>% desc()) %>%
  mutate(
    rank_fatalities = order(total_fatalities %>% desc())
  ) %>%
  as.data.frame()

highest_avg_inj <- storms_by_event %>%
  filter(EVTYPE %in% filtered_events) %>%
  arrange(mean_injuries %>% desc()) %>%
  as.data.frame()

most_inj <- storms_by_event %>%
  filter(EVTYPE %in% filtered_events) %>%
  arrange(total_injuries %>% desc()) %>%
  mutate(
    rank_inj = order(total_injuries %>% desc())
  ) %>%
  as.data.frame()

One way to think about the ‘worst’ weather for human health is to look at the aggregate loss of life by weather event, over time. This does not account for the fact that some weather events are more likely than others, so it will likely advantage weather events that are high-frequency, but if you were trying to predict the most likely cause of death for a weather-related event, knowing these aggregate totals gives a good sense of what claims the most lives.

most_deaths$y_var <- most_deaths$total_fatalities
most_deaths$x_var <- most_deaths$rank_fatalities
most_deaths$ypos <- 1

p <- ggplot(
  data = most_deaths[1:20, ],
  aes(x = x_var, y = y_var)
) +
geom_bar(stat = 'identity') +
geom_text(
  aes(
    label = EVTYPE,
    y = ypos
  ),
  angle = 90,
  color = 'hotpink',
  hjust = 0
) +
theme_bw() +
theme(panel.grid = element_blank())

p + labs(x = 'rank', y = 'total fatalities')

Fig 1: 20 Most Deadly Weather Events, All-Time

Another way to think about the most dangerous weather events is to examine which events kill the largest number of people (on average) when they occur. Because we are looking at averages, this puts high and low frequency weather events on an equal footing. Note that some basic filtering has been done here, such that only weather events reported at least 35 times qualify for analysis.

highest_avg_deaths$y_var <- highest_avg_deaths$mean_fatalities
highest_avg_deaths$x_var <- highest_avg_deaths$rank_fatalities
highest_avg_deaths$ypos <- 0

p %+% highest_avg_deaths[1:20,] + labs(x = 'rank', y = 'avg fatalities')

Fig 2: 20 Highest Average Mortality Weather Events, All-Time

Property Damage

most_damage <- storms_by_event %>%
  filter(EVTYPE %in% filtered_events) %>%
  arrange(total_propdmg %>% desc()) %>%
  mutate(
    rank_propdmg = order(total_propdmg %>% desc())
  ) %>%
  as.data.frame()

most_damage %>% head()

##              EVTYPE total_propdmg mean_propdmg total_injuries
## 1           TORNADO     3212258.2        52.96          91346
## 2       FLASH FLOOD     1420124.6        26.16           1777
## 3         TSTM WIND     1335965.6         6.07           6957
## 4             FLOOD      899938.5        35.53           6789
## 5 THUNDERSTORM WIND      876844.2        10.62           1488
## 6              HAIL      688693.4         2.39           1361
##   mean_injuries total_fatalities mean_fatalities      n rank_propdmg
## 1          1.51             5633            0.09  60652            1
## 2          0.03              978            0.02  54277            2
## 3          0.03              504            0.00 219940            3
## 4          0.27              470            0.02  25326            4
## 5          0.02              133            0.00  82563            5
## 6          0.00               15            0.00 288661            6

The weather event that has created the largest economic damage in the US is tornadoes, followed by flash flooding in 2nd and flooding in 4th. wind/thunderstorms comes in third and fifth, respectively.

most_damage$y_var <- most_damage$total_propdmg
most_damage$x_var <- most_damage$rank_propdmg
most_damage$ypos <- 0

p %+% most_damage[1:20,] + labs(x = 'rank', y = 'avg property damage (millions of dollars)') +
  scale_y_continuous(labels = comma)

Fig 3: 20 Most Economically Damaging Events, All-Time