Most harmful weather events in the US

Synopsis

To answer the two questions posed in the task:

Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

we have summed up the corresponding variables in NOAA Storm Database by each event type.

The results show that throughout all the history of observations the most dangerous event for human lives were tornados, while floods were most harmful for the economics.

Data Processing

library(dplyr)
library(readr)
library(tidyr)
library(ggplot2)

First we obtain the raw data from the given URL (unless we did already):

if (!dir.exists('data')) {
    download.file('https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2', 'data.csv.bz2')
    unzip('data.csv.bz2', exdir = 'data')
}

We read the data directly from the archive with readr::read_csv:

raw_data <- read_csv('data.csv.bz2', progress = FALSE)

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   STATE__ = col_double(),
##   COUNTY = col_double(),
##   BGN_RANGE = col_double(),
##   COUNTY_END = col_double(),
##   END_RANGE = col_double(),
##   LENGTH = col_double(),
##   WIDTH = col_double(),
##   F = col_integer(),
##   MAG = col_double(),
##   FATALITIES = col_double(),
##   INJURIES = col_double(),
##   PROPDMG = col_double(),
##   CROPDMG = col_double(),
##   LATITUDE = col_double(),
##   LONGITUDE = col_double(),
##   LATITUDE_E = col_double(),
##   LONGITUDE_ = col_double(),
##   REFNUM = col_double()
## )

## See spec(...) for full column specifications.

According to the codebook,

[Damage] estimates are rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.

However, we observe more kinds of values in PROPDMGEXP and CROPDMGEXP columns:

table(c(raw_data$PROPDMGEXP, raw_data$CROPDMGEXP))

## 
##      +      -      0      1      2      3      4      5      6      7 
##      5      1    235     25     14      4      4     28      4      5 
##      8      ?      B      H      K      M      h      k      m 
##      1     15     49      6 706497  13324      1     21      8

We treat digit values as 10 ^ x, and accept both lower- and uppercase K, M, B count modifiers. The rest of them we drop, as shown in the table above, those are rare.

normalizeValues <- function(values, values_exp) {
    modifier <- values_exp
    modifier[modifier %in% c('k', 'K')] <- 3
    modifier[modifier %in% c('m', 'M')] <- 6
    modifier[modifier %in% c('b', 'B')] <- 9
    modifier <- as.numeric(modifier)
    values * (10 ^ modifier)
}

We read the data directly from the archive with readr::read_csv and normalize damage costs and described above:

Now we choose only the columns which characterize damage, give them lowercase names and normalize the damage cost values as described above. Afterwards we sum up all the variables within each event type, and calculate the percentage of each event for each variable.

summarised <- raw_data %>%
    mutate(
        crop_damage = normalizeValues(CROPDMG, CROPDMGEXP),
        prop_damage = normalizeValues(PROPDMG, PROPDMGEXP),
        total_damage = crop_damage + prop_damage
    ) %>%
    select(
        event_type = EVTYPE,
        fatalities = FATALITIES,
        injuries = INJURIES,
        crop_damage,
        prop_damage,
        total_damage
    ) %>%
    group_by(event_type) %>%
    summarise_all(sum, na.rm = TRUE)

## Warning in normalizeValues(CROPDMG, CROPDMGEXP): NAs introduced by coercion

## Warning in normalizeValues(PROPDMG, PROPDMGEXP): NAs introduced by coercion

total_fatalities <- sum(summarised$fatalities)
total_injuries <- sum(summarised$injuries)
total_damage <- sum(summarised$total_damage)

We choose top 5 most dangerous events for human lives. As it is ethically hard to compare fatalities to injuries, we sort by fatalities only.

top5_health_dmg <- summarised %>%
    select(event_type, fatalities, injuries) %>%
    arrange(desc(fatalities)) %>%
    head(5)

With economical damage we can sort by the sum of property and crop damage:

top5_economic_dmg <- summarised %>%
    select(event_type, prop_damage, crop_damage, total_damage) %>%
    arrange(desc(total_damage)) %>%
    head(5)

Results

Tornados are most dangerous for human lives

Top five most dangerous events for human lives correspond to 67% of all fatalities caused and 76% of all injuries.

Tornados are by far the most dangerous among them, judged either by fatalities or injuries.

top5_health_dmg_displ <- top5_health_dmg %>%
    select(event_type, Fatal = fatalities, Injury = injuries) %>%
    mutate(
        Fatal = 100. * Fatal / total_fatalities,
        Injury = 100. * Injury /total_injuries
    ) %>%
    gather('Outcome', 'percentage', -event_type) %>%
    mutate(
        event_type = factor(event_type, levels = rev(top5_health_dmg$event_type))
    )

ggplot(aes(y = percentage, x = event_type, fill = Outcome), data = top5_health_dmg_displ) +
    geom_bar(stat = 'identity', position = 'dodge') +
    xlab('Event Type') +
    ylab('% of cases') +
    ylim(0, 100) +
    ggtitle('5 most dangerous weather events for human lives') +
    coord_flip()

Floods are most harmful to economics

Top five most harmful events for economics correspond to 78% of all economical damage caused.

Floods are by far the most harmful of these, causing the total damage of 138 billion dollars throughout the history of observations.

It is also worth noting that the property damage tends to be more significant than crop damage for all of these events.

top5_economic_dmg_displ <- top5_economic_dmg %>%
    select(event_type, Property = prop_damage, Crop = crop_damage) %>%
    gather('object', 'damage', -event_type) %>%
    mutate(
        event_type = factor(event_type, levels = rev(top5_economic_dmg$event_type))
    )

ggplot(aes(y = damage / 1e9, x = event_type, fill = object), data = top5_economic_dmg_displ) +
    geom_bar(stat = 'identity') +
    xlab('Event Type') +
    ylab('Damage caused, bln USD') +
    ggtitle('5 most harmful weather events for economics') +
    coord_flip()

Follow-up

There are many other aspects that could be considered, we just mention them here:

does the result change if we consider only latest years?
what is the seasonal tendency?
which locations are most dangerous?
does the result change if we group together similar events (like ‘heat’ and ‘excessive heat’)?
etc.