Synopsis

In this report we aim to identify the most harmful natural events recorded in the Storm Data publication, produced by the National Oceanic and Atmospheric Administration (NOAA). Our main goal is to distinguish the most dangerous events both for human health and the economy. We take two different approaches to that problem, by measuring the total and average harm caused by each event type. We have found floods and tornadoes to be the most dangerous types of events to hit the US, causing the most severe consequences both in total and mean measurements.

Data Processing

We begin by loading everything we´ll need for this analysis (i.e., packages and data set) and reading in the data from Storm Data.

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(lubridate))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(usmap))
if (!file.exists("StormData.csv.bz2")) {
    download.file('https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2', destfile = "StormData.csv.bz2")}

BaseStormData <- tbl_df(read.csv("StormData.csv.bz2", stringsAsFactors = FALSE))
BaseStormData$BGN_DATE <- date(mdy_hms(BaseStormData$BGN_DATE))

The NOAA’s Database only started to account for all event types in jan 1996. All entries before then was limited to Tornado, Thunderstorm Wind and Hail events, which could confuse and bias our results. Since our goal is to identify the effects of different types of event, we’ll reduce the data set to get rid of the registers older than jan 1996. Reducing the time span still leaves us with over 60,000 observations, which is quite a robust data set. We’ll also drop some of the columns that won’t be useful to this study.

StormData <- filter(BaseStormData, BGN_DATE > ymd(19960101)) %>% 
    select(BGN_DATE, STATE, COUNTYNAME, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

Now we can group the observations by event type and check how many distinct events have been recorded.

by_event <- StormData %>%
    group_by(EVTYPE) %>%
    summarize(freq = n())

count_events <- length(by_event$EVTYPE)
print(count_events)
## [1] 515

According to the National Weather Service Instruction 10-1605, there are 48 distinct event types, so we expected to have a maximum of 48 unique records in this column. In stead, we got 515 distinct event types on record, which shows we have some cleaning work to do before we actually start analyzing the data.

Dealing with different descriptions and typos

It took us some time and some decision making to adjust different descriptions. We focused on the most frequent descriptions and dealt with them almost individually to group the higher possible number of events into a unique (and consistent) description. We’re not going to describe the decision making process here, but the rules applied were this:

StormData$EVTYPE <- sub("WIND/HAIL", "WIND", StormData$EVTYPE)
StormData$EVTYPE <- sub("TSTM WIND", "THUNDERSTORM WIND", StormData$EVTYPE)
StormData$EVTYPE <- sub("WINDCHILL", "COLD/WIND CHILL", StormData$EVTYPE)
StormData$EVTYPE <- sub("WINDS", "WIND", StormData$EVTYPE)
StormData$EVTYPE <- sub("^WIND$", "HIGH WIND", StormData$EVTYPE)
StormData$EVTYPE <- sub("EXTREME COLD$", "EXTREME COLD/WIND CHILL", StormData$EVTYPE)
StormData$EVTYPE <- sub("WILD/FOREST FIRE", "WILDFIRE", StormData$EVTYPE)
StormData$EVTYPE <- sub("WEATHER/MIX", "WEATHER", StormData$EVTYPE)
StormData$EVTYPE <- sub("CURRENTS", "CURRENT", StormData$EVTYPE)
StormData$EVTYPE <- sub("HEAVY SURF/HIGH SURF", "HIGH SURF", StormData$EVTYPE)
StormData$EVTYPE <- sub("^HURRICANE$", "HURRICANE/TYPHOON", StormData$EVTYPE)
StormData$EVTYPE <- sub("^FOG", "DENSE FOG", StormData$EVTYPE)
StormData$EVTYPE <- sub("FLOODING", "FLOOD", StormData$EVTYPE)
StormData$EVTYPE <- sub("FLOODIN", "FLOOD", StormData$EVTYPE)
StormData$EVTYPE <- sub("FLDG", "FLOOD", StormData$EVTYPE)
StormData$EVTYPE <- sub("FLD", "FLOOD", StormData$EVTYPE)
StormData$EVTYPE <- sub("^RIVER FLOOD$", "FLASH FLOOD", StormData$EVTYPE)
StormData$EVTYPE <- sub("^URBAN/SML STREAM FLOOD$", "FLASH FLOOD", StormData$EVTYPE)
StormData$EVTYPE <- sub("^URBAN/SMALL STREAM FLOOD$", "FLASH FLOOD", StormData$EVTYPE)
StormData$EVTYPE <- sub("^URBAN/SMALL$", "FLASH FLOOD", StormData$EVTYPE)
StormData$EVTYPE <- sub("^SMALL STREAM FLOOD$", "FLASH FLOOD", StormData$EVTYPE)
StormData$EVTYPE <- sub("^URBAN FLOOD$", "FLASH FLOOD", StormData$EVTYPE)
StormData$EVTYPE <- sub("^STORM SURGE$", "STORM SURGE/TIDE", StormData$EVTYPE)

We’re confident we’ve made good decisions here, because it allowed us to keep 651,079 of the 653,507 lines in the data set from Jan 1996. That’s over 99% of the available registers.

Dealing with PROPDMG and CROPDMG

There’s very a good explanation on how to read the columns PROPDMGEXP and CROPDMGEXP here. We’ve translated those rules into code and adapted the information on property damage (PROPDMG) and crop damage (CROPDM) to show the actual values in US dollars. We’ve realized there’s one serious mistake in the data base that we corrected here: the entry for the 01/01/2006 flood in California reported propriety damage of 115 billion dollars in NAPA county, when the total damage was actually estimated in 115 million dollars.

StormData <- StormData %>% 
    mutate(PROPDMG = ifelse(STATE == "CA" & BGN_DATE == ymd(20060101) & EVTYPE == "FLOOD" & COUNTYNAME == "NAPA", PROPDMG*1000000,
                            ifelse(PROPDMGEXP == "B", PROPDMG*1000000000, 
                                ifelse(PROPDMGEXP == "M", PROPDMG*1000000, 
                                       ifelse(PROPDMGEXP == "K", PROPDMG*1000,
                                              ifelse(PROPDMGEXP == "0", PROPDMG*10, PROPDMG
                                              )))))) %>% 
    mutate(CROPDMG = ifelse(CROPDMGEXP == "B", CROPDMG*1000000000, 
                            ifelse(CROPDMGEXP == "M", CROPDMG*1000000, 
                                   ifelse(CROPDMGEXP == "K", CROPDMG*1000, CROPDMG
                                   )))) %>% 
    select(-PROPDMGEXP, -CROPDMGEXP)

Cutting to the chase

by_event2 <- StormData %>%
    group_by(EVTYPE) %>%
    summarize(freq = n())

count_events2 <- length(by_event2$EVTYPE)
count_lines <- sum(by_event2$freq)
print(count_events2)
## [1] 487
print(count_lines)
## [1] 653507

All the substitutions on the EVTYPE column entries left us with 487 distinct event types, which is still quite a lot. We’ve decided to leave out of our study events that had, alone, less than 1% of the occurrences in the data set. We rounded that threshold to 6,500 occurrences.

by_event2 <- by_event2 %>%
    filter(freq >= 6500) %>% 
    mutate(percent = freq*100/sum(by_event2$freq), EVTYPE = as.factor(EVTYPE)) %>% 
    arrange(desc(freq))

count_events3 <- length(by_event2$EVTYPE)
count_lines2 <- sum(by_event2$freq)
percent_kept <- round((count_lines2/653507)*100,1)

events <- as.vector(by_event2$EVTYPE)

StormData <- StormData %>% 
    filter(EVTYPE %in% events)

print(count_events3)
## [1] 12
print(count_lines2)
## [1] 611006

That left us with a much easier to analyze data set with only 12 distinct event types, and still 611006 observations, 93.5% of the original number of lines in the data set.

Results

We can analyze the relevance of an extreme weather event from two perspectives: the sum of all harm done, which would make more frequent events the most relevant, or the mean harm per event, which can bring to our attention the most tragic events, even if they’re not that common. These two approaches seem relevant, so we’re going to analyze both of them.

summarized <- StormData %>%
    mutate(EVTYPE = as.factor(EVTYPE)) %>% 
    group_by(EVTYPE) %>% 
    summarize(freq = n(), total_dead = sum(FATALITIES), avg_dead = mean(FATALITIES), total_injured = sum(INJURIES), avg_injured = mean(INJURIES),
              total_prop_damage = sum(PROPDMG), avg_prop_damage = mean(PROPDMG), total_crop_damage = sum(CROPDMG), avg_crop_damage = mean(CROPDMG))

There are two ways of measuring human life consequences (fatalities and injuries) and two ways of measuring economic consequences (property and crop damage). Let’s see if any event type seems especially harmful both in the total and average measurements.

par(mfrow = c(2,2), font.main = 2, cex.main = 1.1, mar = c(4,4,4,1), oma = c(0,0,2,0))
with(summarized, plot(total_dead, total_injured, pch = 16))
abline(h = 2500, v = 500, lty = 2)
title(main = "TOTAL dead and injured")

with(summarized, plot(avg_dead, avg_injured, pch = 16))
abline(h = 0.2, v = 0.02, lty = 2)
title(main = "AVG dead and injured")

with(summarized, plot(total_prop_damage, total_crop_damage, pch = 16))
abline(h = 1000000000, v = 10000000000, lty = 2)
title(main = "TOTAL economic loss")

with(summarized, plot(avg_prop_damage, avg_crop_damage, pch = 16))
abline(h = 50000, v = 200000, lty = 2)
title(main = "AVG economic loss")

mtext("TOTAL AND AVERAGE HARM BY EVENT TYPE", font = 2, outer = TRUE, cex = 1.3)

Not surprisingly, we can observe a positive correlation between fatalities and injuries. Crop and property damage are also positively correlated. For every graph in the panel above there is one clear outlier, which will be our main concern, and we’re going to look at them shortly. We’ve added two dashed lines to each graph to determine separate the most harmful from the least harmful events. We’ll consider most dangerous the events to the right of each vertical line and to the top of each horizontal line. It’s not a fancy way to cluster those events, but it makes sense to worry separately about each of those dimensions. It’s time to find out which events were found to be the most harmful in each graph.

top5_health <- summarized %>% 
    filter(total_dead >= 500 | total_injured >= 2500) %>% 
    arrange(desc(total_injured))

top5_dmg <- summarized %>% 
    filter(total_prop_damage >= 10000000000 | total_crop_damage >= 1000000000) %>% 
    arrange(desc(total_prop_damage))

high_avg_health <- summarized %>% 
    filter(avg_dead >= 0.03 | avg_injured >= 0.2) %>% 
    arrange(desc(avg_injured))

high_avg_dmg <- summarized %>% 
    filter(avg_prop_damage >= 200000 | avg_crop_damage >= 50000) %>% 
    arrange(desc(avg_prop_damage))
as.data.frame(select(top5_health, EVTYPE, freq, total_dead, total_injured))
##              EVTYPE   freq total_dead total_injured
## 1           TORNADO  23153       1511         20667
## 2             FLOOD  24247        414          6758
## 3 THUNDERSTORM WIND 211091        376          5124
## 4         LIGHTNING  13202        650          4140
## 5       FLASH FLOOD  54499        917          1753
as.data.frame(select(high_avg_health, EVTYPE, freq, avg_dead, avg_injured))
##      EVTYPE  freq   avg_dead avg_injured
## 1   TORNADO 23153 0.06526152   0.8926273
## 2 LIGHTNING 13202 0.04923496   0.3135889
## 3     FLOOD 24247 0.01707428   0.2787149
as.data.frame(select(top5_dmg, EVTYPE, freq, total_prop_damage, total_crop_damage))
##              EVTYPE   freq total_prop_damage total_crop_damage
## 1             FLOOD  24247       29059833550        4974778400
## 2           TORNADO  23153       24616905710         283425010
## 3       FLASH FLOOD  54499       15300795560        1345264800
## 4              HAIL 207714       14595143420        2476029450
## 5 THUNDERSTORM WIND 211091        7905001380        1016942600
as.data.frame(select(high_avg_dmg, EVTYPE, freq, avg_prop_damage, avg_crop_damage))
##        EVTYPE  freq avg_prop_damage avg_crop_damage
## 1       FLOOD 24247      1198491.92       205170.88
## 2     TORNADO 23153      1063227.47        12241.39
## 3 FLASH FLOOD 54499       280753.69        24684.21
## 4   HIGH WIND 20229       259560.03        31334.29
## 5  HEAVY RAIN 11509        50818.01        63269.60

So we’ve found the outlier events were TORNADO for the human live incidents, and FLOOD for economic loss. Those two events were found to be the most harmful both in the total and the average measurements. Both events also appear as one of the most relevant in all four approaches, which is enough to say they are definitely the most dangerous ones and should get most attention.

Frequency by State

Now that we know which events are the most harmful, lets have a look on how often does each state in the US is hit by such events. We’re first going to group the information and finish up with some nice map plots.

fl_by_state <- StormData %>% 
    filter(EVTYPE == "FLOOD") %>% 
    group_by(STATE) %>% 
    summarize(freq = n()) %>% 
    mutate(fips = fips(STATE)) %>% 
    select(fips, freq) %>% 
    filter(!is.na(fips))

tn_by_state <- StormData %>% 
    filter(EVTYPE == "TORNADO") %>% 
    group_by(STATE) %>% 
    summarize(freq = n()) %>% 
    mutate(fips = fips(STATE)) %>% 
    select(fips, freq) %>% 
    filter(!is.na(fips))
plot_usmap(regions = "states", data = as.data.frame(tn_by_state), values = "freq") + ggtitle("Frequency of tornadoes per state")

plot_usmap(regions = "states", data = as.data.frame(fl_by_state), values = "freq") + ggtitle("Frequency of floods per state")