Synopsis

This document explores the answer to two key issues facing urban emergency managers: what types of events cause the greatest damage to human health and which types of events cause the greatest economic damage. To answer this question, we leverage the last 25 years of data in the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. We focus on the sum of fatalities and injuries as being the key measure of human health, and both crop and property damage as driving economic consequences. Our initial findings suggest that tornados (and tornado-related storms) have been the most significant cause of death. Economic consequences are, by far, most significant for hurricanes and very concentrated in a few in that category type. Further study is necessary to determine the best ways for urban emergency planners take into account their specific geographic and risk profile to ensure that investments in safety and storm management are most effective.

Introduction

In this study, we were trying to answer two basic questions:

  1. Across the United States, which types of events are most harmful with respect to population health? and
  2. Across the United States, which types of events have the greatest economic consequences?

The analysis that was performed used the R language and all of the code that was used to explore the data is provided in this document to support the reproducibility of the findings. This document is best understood if you have a solid understanding of the data.table and ggplot packages and at least a passing familiarity of GREP. First, the environment and necessary libraries should be loaded:

knitr::opts_chunk$set(echo=TRUE)
knitr::opts_chunk$set(cache=FALSE)
# code to add captions
knitr::knit_hooks$set(htmlcap = function(before, options, envir) {
  if(!before) {
    paste0('<p class="caption">',options$htmlcap,"</p>")
  }
})

library(data.table)
library(lubridate)
library(ggplot2)
library(scales)
library(gridExtra)
library(downloader)

Data Processing

The full data processing chain is shown from initial download to final output. To explore the health impacts, I have chosen to sum injuries and fatalities with equal weighting. This may not be the ideal way to evaluate human health impact, but further study will be required and this remains outside the scope of this initial study. We also have to create valid numbers upon which to compare the economic damages. Economic damage will be considered to be the sum of property and crop damage. This was done after examining National Weather Service Storm Data Documentation found here. Now, for the code which demonstrates the download and initial processing:

# Get the data; be explicit about where the data is going locally
setwd('/Volumes/MacStorage/Coursera Data/storm/')
remote_filename <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
downloaded_filename <- "repdata-data-StormData.csv.bz2"
download(remote_filename, downloaded_filename)
# Load it in
data <- data.table(read.csv(downloaded_filename))
# Limit to the last 25 years worth of information (based on the date of the start of a storm event) and
#   the columns of data that we will use in this analysis
# Show Sys.Date() for reproducibility
Sys.Date()
## [1] "2016-03-06"
# Shows that we are only losing 18% of the data
nrow(data[as.Date(as.character(BGN_DATE), format = '%m/%d/%Y') > Sys.Date() - years(25)])/nrow(data)
## [1] 0.820487
data <- data[as.Date(as.character(BGN_DATE), format = '%m/%d/%Y') > Sys.Date() - years(25), .(EVTYPE, FATALITIES, INJURIES, PROPDMGEXP, PROPDMG, CROPDMGEXP, CROPDMG)]
# Look at the information contained in the database
str(data)
## Classes 'data.table' and 'data.frame':   740323 obs. of  7 variables:
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 856 244 856 244 856 856 244 244 856 244 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ INJURIES  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ PROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, ".internal.selfref")=<externalptr>
# To explore human health effects, going to add together fatalities and injuries
data[, humanHealth := FATALITIES + INJURIES]

## Economic consequences: need to create columns that are meaningful (apply the order of magnitude column)
vec.validExponents <- c("K", "M", "B")
applyExponent <- function(x) { if(x == "K") { 1e3 } else if(x == "M") { 1e6 } else if(x == "B") { 1e9 } else { 0 } }
data[, propertyDamage := as.numeric(0)]
data[PROPDMGEXP %in% vec.validExponents, propertyDamage := as.numeric(PROPDMG) * ifelse(PROPDMGEXP == "B", 1e9, ifelse(PROPDMGEXP == "M", 1e6, 1e3))]
data[, cropDamage := as.numeric(0)]
data[CROPDMGEXP %in% vec.validExponents, cropDamage := as.numeric(CROPDMG) * ifelse(PROPDMGEXP == "B", 1e9, ifelse(PROPDMGEXP == "M", 1e6, 1e3))]
data[, damage := propertyDamage + cropDamage]
# Shows that the ratio of untreated zeros to treated zeros is very very close to 1 
# (i.e., our transformations did not adversly impact the data set)
nrow(data[PROPDMG == 0 & CROPDMG == 0])/nrow(data[damage == 0])
## [1] 0.9993786

Some initial exploration reveals that the event types (coded in the EVTYPE column) are very similar (e.g., THUNDERSTORM WIND and THUNDERSTORM WINDS)

# Top 20 health impact
data[, .(healthImpact = sum(humanHealth)), by = EVTYPE][order(-healthImpact)][1:20]
##                 EVTYPE healthImpact
##  1:            TORNADO        27140
##  2:     EXCESSIVE HEAT         8428
##  3:              FLOOD         7259
##  4:          LIGHTNING         6046
##  5:          TSTM WIND         4698
##  6:               HEAT         3037
##  7:        FLASH FLOOD         2755
##  8:          ICE STORM         2064
##  9:  THUNDERSTORM WIND         1621
## 10:       WINTER STORM         1527
## 11:          HIGH WIND         1385
## 12:  HURRICANE/TYPHOON         1339
## 13:         HEAVY SNOW         1148
## 14:               HAIL         1086
## 15:           WILDFIRE          986
## 16: THUNDERSTORM WINDS          972
## 17:           BLIZZARD          906
## 18:                FOG          796
## 19:        RIP CURRENT          600
## 20:   WILD/FOREST FIRE          557
# Ratio of Top 20 impacts to total health impacts
origPercentCovered <- round(sum(data[, .(healthImpact = sum(humanHealth)), by = EVTYPE][order(-healthImpact)][1:20]$healthImpact)/sum(data$humanHealth)*100,1)

Here is another example of inconsistent coding:

# Inconsistent coding example
unique(data[grepl("RIP CURRENT", EVTYPE)]$EVTYPE)
## [1] RIP CURRENT             RIP CURRENTS HEAVY SURF RIP CURRENTS/HEAVY SURF
## [4] RIP CURRENTS           
## 985 Levels:    HIGH SURF ADVISORY  COASTAL FLOOD ... WND

So, after some initial exploration, the following code was run on the data set to help consolidate the EVTYPEs into something more meaningful:

# create two new columns: EventTypeClean to represent the roll-up of related event types and
# rollupComplete to signify that a column has already been rolled up
data[, `:=` (EventTypeClean = as.character(EVTYPE),
             rollupComplete = 0)]
# data table to hold the consolidating string and the summary string name
dt.grep <- data.table(grepString = c("TORN",   "(THUNDERSTORM|THUNDER|TSTM|THUNDEER) WIND","RIP CURRENT", "FLASH",      "COLD","SURF","HEAT","WIND","WINTER","HURRICANE","RIVER FLOOD"),
                      eventClean = c("TORNADO","THUNDERSTORM WIND"                        ,"RIP CURRENT", "FLASH FLOOD","COLD","SURF","HEAT","WIND","WINTER","HURRICANE","RIVER FLOOD"))
# Loop through each element in dt.grep and make into a summarized event type
for(i in 1:nrow(dt.grep)) {
    data[rollupComplete == 0 & grepl(dt.grep[i]$grepString, EventTypeClean, ignore.case = TRUE), 
         `:=` (EventTypeClean = dt.grep[i]$eventClean,
               rollupComplete = 1)]
}

From here, we can see that there are many more events rolled together. Specifically:

data[, .(healthImpact = sum(humanHealth)), by = EventTypeClean][order(-healthImpact)][1:20]
##        EventTypeClean healthImpact
##  1:           TORNADO        27229
##  2:              HEAT        12362
##  3: THUNDERSTORM WIND         7458
##  4:             FLOOD         7259
##  5:         LIGHTNING         6046
##  6:       FLASH FLOOD         2837
##  7:              WIND         2439
##  8:            WINTER         2153
##  9:         ICE STORM         2064
## 10:         HURRICANE         1461
## 11:        HEAVY SNOW         1148
## 12:       RIP CURRENT         1106
## 13:              HAIL         1086
## 14:          WILDFIRE          986
## 15:          BLIZZARD          906
## 16:               FOG          796
## 17:              COLD          771
## 18:  WILD/FOREST FIRE          557
## 19:        DUST STORM          462
## 20:              SURF          407
percent_covered <- paste0(round(sum(data[, .(healthImpact = sum(humanHealth)), by = EventTypeClean][order(-healthImpact)][1:20]$healthImpact)/sum(data$humanHealth) * 100,0),"%")

So our top 20 events cover 96% of the events; much more than the 89.8% from before.

Results

The results will focus on the two questions as noted above and covered in their two independent sections below.

Harm to Population Health Due to Storms

To best illustrate the impact, let’s look at a waterfall chart of Health Impact focusing on the top twenty event types:

top20 <- data[, .(healthImpact=sum(humanHealth)), by=EventTypeClean][order(-healthImpact)][1:20]$EventTypeClean
dt.plot <- rbind(data[EventTypeClean %in% top20, .(healthImpact=sum(humanHealth), barType = 'Event'), by = EventTypeClean][order(-healthImpact)],
                 data[!(EventTypeClean %in% top20), .(EventTypeClean = 'Other', healthImpact=sum(humanHealth), barType = 'Other')],
                 data[, .(EventTypeClean = 'Total', healthImpact=sum(humanHealth), barType = 'Total')])
dt.plot[, EventTypeClean := factor(EventTypeClean, levels = EventTypeClean)]
# set up start and end of waterfall bars
dt.plot[, counter := 1]
dt.plot[, id := cumsum(counter)]
dt.plot[, end := cumsum(healthImpact)]
dt.plot[, start := c(0, head(dt.plot$end, -1))]
dt.plot[EventTypeClean == 'Total', `:=` (end = healthImpact, start = 0)]
# generate the plot
ggplot(data = dt.plot) + 
    geom_hline(yintercept = 0) + 
    geom_rect(mapping = aes(x = EventTypeClean, xmin = id - 0.45, xmax = id + 0.45, 
                            ymin = start/1e3, ymax = end/1e3, fill = barType)) +
    theme(legend.position = c(0.5, 0.5)) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
    scale_fill_discrete(name = "Event Category") +
    scale_x_discrete(name = "Event Type") +
    scale_y_continuous(name = "Impact to Health\n(Thousands; Fatalities + Injuries)") +
    ggtitle("Health Impact of Storms in U.S. of Last 25 Years")

Caption: A waterfall chart of the human health, aggregated by event type. Tornados are key.

Clearly, tornados, have the greatest impact on human health and their impact should be considered carefully for urban planners. Distant second and third place event types are Heat and Thunderstorms/Wind.

Economic Consequences of Storms

To expore this, let’s consider the cumulative distribution function of the underlying economic impact data, rolled up by event type:

ggplot(data=data[damage > 0, .(damage = sum(damage)), by=EventTypeClean], mapping=aes(x=damage/1e9)) + 
    stat_ecdf(geom="step") +
    scale_x_continuous(name = "Storm Economic Damage ($B)\nRank order of Event Type by Damage") +
    scale_y_continuous(name = "CDF", labels = percent) +
    ggtitle("Cumulative Distribution Function of Damage by Event Type")

Caption: The CDF shows both that many event types have zero economic consequence and a few have large impacts.

This is very difficult to glean much from the graph because the distribution is so long-tailed. The one aspect that we can clearly see is that most event types have zero or near zero economic consequence. It may be easier to see the events with significant consequence if we just look at a table of results of the top 20 event types, sorted in descending order by total economic damage. Here are our results:

format_big <- function(x) { return( format(round(x/1e9, 1), big.mark = ',', big.interval = 3L) ) }
tableToGrob <- copy(data)
tableToGrob <- tableToGrob[, .(propertyDamage = sum(propertyDamage), cropDamage=sum(cropDamage), damage=sum(damage)), by=EventTypeClean][order(-damage)][1:20]
tableToGrob[, `:=` (propertyDamage = format_big(propertyDamage),
                    cropDamage = format_big(cropDamage),
                    damage = format_big(damage))]
setnames(tableToGrob, c("EventTypeClean", "propertyDamage", "cropDamage", "damage"),
         c('Event Type', 'Property Damage\n($B)', 'Crop Damage\n($B)', 'Total Damage\n($B)'))
grid.arrange(tableGrob(tableToGrob, theme = ttheme_minimal(), rows = NULL))

# fraction caused by top 20 events:
top20economic <- round(data[order(-damage)][1:20][, sum(damage)]/sum(data$damage)*100,0)

Caption: Table with Top 20 Economic Damage Event Types

This shows that hurricanes and floods should receive the vast majority of focus for urban safety planning. We should be somewhat cautious of any of our conclusions here, given that the top 20 events account for 84% of all economic consequences over the last 25 years.

Conclusions

Tornados, heat and Thunderstorms/Wind are the three most dangerous storm event types (i.e., highest impact to human health) in the United States over the last 25 years. Hurricanes, floods and tornados, however, have had the most significant economic impacts. Planning for urban centers should determine how these types of storm can effect the specific city environment and look for ways to protect people and property should a city be unfortunate enough to experience a devastating storm. Further study is required to help urban planners better understand the specific risks for their geography and what the optimal understanding of human health is with regards to the balance between injuries and fatalities.