This document explores the answer to two key issues facing urban emergency managers: what types of events cause the greatest damage to human health and which types of events cause the greatest economic damage. To answer this question, we leverage the last 25 years of data in the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. We focus on the sum of fatalities and injuries as being the key measure of human health, and both crop and property damage as driving economic consequences. Our initial findings suggest that tornados (and tornado-related storms) have been the most significant cause of death. Economic consequences are, by far, most significant for hurricanes and very concentrated in a few in that category type. Further study is necessary to determine the best ways for urban emergency planners take into account their specific geographic and risk profile to ensure that investments in safety and storm management are most effective.
In this study, we were trying to answer two basic questions:
The analysis that was performed used the R language and all of the code that was used to explore the data is provided in this document to support the reproducibility of the findings. This document is best understood if you have a solid understanding of the data.table and ggplot packages and at least a passing familiarity of GREP. First, the environment and necessary libraries should be loaded:
knitr::opts_chunk$set(echo=TRUE)
knitr::opts_chunk$set(cache=FALSE)
# code to add captions
knitr::knit_hooks$set(htmlcap = function(before, options, envir) {
if(!before) {
paste0('<p class="caption">',options$htmlcap,"</p>")
}
})
library(data.table)
library(lubridate)
library(ggplot2)
library(scales)
library(gridExtra)
library(downloader)
The full data processing chain is shown from initial download to final output. To explore the health impacts, I have chosen to sum injuries and fatalities with equal weighting. This may not be the ideal way to evaluate human health impact, but further study will be required and this remains outside the scope of this initial study. We also have to create valid numbers upon which to compare the economic damages. Economic damage will be considered to be the sum of property and crop damage. This was done after examining National Weather Service Storm Data Documentation found here. Now, for the code which demonstrates the download and initial processing:
# Get the data; be explicit about where the data is going locally
setwd('/Volumes/MacStorage/Coursera Data/storm/')
remote_filename <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
downloaded_filename <- "repdata-data-StormData.csv.bz2"
download(remote_filename, downloaded_filename)
# Load it in
data <- data.table(read.csv(downloaded_filename))
# Limit to the last 25 years worth of information (based on the date of the start of a storm event) and
# the columns of data that we will use in this analysis
# Show Sys.Date() for reproducibility
Sys.Date()
## [1] "2016-03-06"
# Shows that we are only losing 18% of the data
nrow(data[as.Date(as.character(BGN_DATE), format = '%m/%d/%Y') > Sys.Date() - years(25)])/nrow(data)
## [1] 0.820487
data <- data[as.Date(as.character(BGN_DATE), format = '%m/%d/%Y') > Sys.Date() - years(25), .(EVTYPE, FATALITIES, INJURIES, PROPDMGEXP, PROPDMG, CROPDMGEXP, CROPDMG)]
# Look at the information contained in the database
str(data)
## Classes 'data.table' and 'data.frame': 740323 obs. of 7 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 856 244 856 244 856 856 244 244 856 244 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 0 0 ...
## $ INJURIES : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ PROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, ".internal.selfref")=<externalptr>
# To explore human health effects, going to add together fatalities and injuries
data[, humanHealth := FATALITIES + INJURIES]
## Economic consequences: need to create columns that are meaningful (apply the order of magnitude column)
vec.validExponents <- c("K", "M", "B")
applyExponent <- function(x) { if(x == "K") { 1e3 } else if(x == "M") { 1e6 } else if(x == "B") { 1e9 } else { 0 } }
data[, propertyDamage := as.numeric(0)]
data[PROPDMGEXP %in% vec.validExponents, propertyDamage := as.numeric(PROPDMG) * ifelse(PROPDMGEXP == "B", 1e9, ifelse(PROPDMGEXP == "M", 1e6, 1e3))]
data[, cropDamage := as.numeric(0)]
data[CROPDMGEXP %in% vec.validExponents, cropDamage := as.numeric(CROPDMG) * ifelse(PROPDMGEXP == "B", 1e9, ifelse(PROPDMGEXP == "M", 1e6, 1e3))]
data[, damage := propertyDamage + cropDamage]
# Shows that the ratio of untreated zeros to treated zeros is very very close to 1
# (i.e., our transformations did not adversly impact the data set)
nrow(data[PROPDMG == 0 & CROPDMG == 0])/nrow(data[damage == 0])
## [1] 0.9993786
Some initial exploration reveals that the event types (coded in the EVTYPE column) are very similar (e.g., THUNDERSTORM WIND and THUNDERSTORM WINDS)
# Top 20 health impact
data[, .(healthImpact = sum(humanHealth)), by = EVTYPE][order(-healthImpact)][1:20]
## EVTYPE healthImpact
## 1: TORNADO 27140
## 2: EXCESSIVE HEAT 8428
## 3: FLOOD 7259
## 4: LIGHTNING 6046
## 5: TSTM WIND 4698
## 6: HEAT 3037
## 7: FLASH FLOOD 2755
## 8: ICE STORM 2064
## 9: THUNDERSTORM WIND 1621
## 10: WINTER STORM 1527
## 11: HIGH WIND 1385
## 12: HURRICANE/TYPHOON 1339
## 13: HEAVY SNOW 1148
## 14: HAIL 1086
## 15: WILDFIRE 986
## 16: THUNDERSTORM WINDS 972
## 17: BLIZZARD 906
## 18: FOG 796
## 19: RIP CURRENT 600
## 20: WILD/FOREST FIRE 557
# Ratio of Top 20 impacts to total health impacts
origPercentCovered <- round(sum(data[, .(healthImpact = sum(humanHealth)), by = EVTYPE][order(-healthImpact)][1:20]$healthImpact)/sum(data$humanHealth)*100,1)
Here is another example of inconsistent coding:
# Inconsistent coding example
unique(data[grepl("RIP CURRENT", EVTYPE)]$EVTYPE)
## [1] RIP CURRENT RIP CURRENTS HEAVY SURF RIP CURRENTS/HEAVY SURF
## [4] RIP CURRENTS
## 985 Levels: HIGH SURF ADVISORY COASTAL FLOOD ... WND
So, after some initial exploration, the following code was run on the data set to help consolidate the EVTYPEs into something more meaningful:
# create two new columns: EventTypeClean to represent the roll-up of related event types and
# rollupComplete to signify that a column has already been rolled up
data[, `:=` (EventTypeClean = as.character(EVTYPE),
rollupComplete = 0)]
# data table to hold the consolidating string and the summary string name
dt.grep <- data.table(grepString = c("TORN", "(THUNDERSTORM|THUNDER|TSTM|THUNDEER) WIND","RIP CURRENT", "FLASH", "COLD","SURF","HEAT","WIND","WINTER","HURRICANE","RIVER FLOOD"),
eventClean = c("TORNADO","THUNDERSTORM WIND" ,"RIP CURRENT", "FLASH FLOOD","COLD","SURF","HEAT","WIND","WINTER","HURRICANE","RIVER FLOOD"))
# Loop through each element in dt.grep and make into a summarized event type
for(i in 1:nrow(dt.grep)) {
data[rollupComplete == 0 & grepl(dt.grep[i]$grepString, EventTypeClean, ignore.case = TRUE),
`:=` (EventTypeClean = dt.grep[i]$eventClean,
rollupComplete = 1)]
}
From here, we can see that there are many more events rolled together. Specifically:
data[, .(healthImpact = sum(humanHealth)), by = EventTypeClean][order(-healthImpact)][1:20]
## EventTypeClean healthImpact
## 1: TORNADO 27229
## 2: HEAT 12362
## 3: THUNDERSTORM WIND 7458
## 4: FLOOD 7259
## 5: LIGHTNING 6046
## 6: FLASH FLOOD 2837
## 7: WIND 2439
## 8: WINTER 2153
## 9: ICE STORM 2064
## 10: HURRICANE 1461
## 11: HEAVY SNOW 1148
## 12: RIP CURRENT 1106
## 13: HAIL 1086
## 14: WILDFIRE 986
## 15: BLIZZARD 906
## 16: FOG 796
## 17: COLD 771
## 18: WILD/FOREST FIRE 557
## 19: DUST STORM 462
## 20: SURF 407
percent_covered <- paste0(round(sum(data[, .(healthImpact = sum(humanHealth)), by = EventTypeClean][order(-healthImpact)][1:20]$healthImpact)/sum(data$humanHealth) * 100,0),"%")
So our top 20 events cover 96% of the events; much more than the 89.8% from before.
The results will focus on the two questions as noted above and covered in their two independent sections below.
To best illustrate the impact, let’s look at a waterfall chart of Health Impact focusing on the top twenty event types:
top20 <- data[, .(healthImpact=sum(humanHealth)), by=EventTypeClean][order(-healthImpact)][1:20]$EventTypeClean
dt.plot <- rbind(data[EventTypeClean %in% top20, .(healthImpact=sum(humanHealth), barType = 'Event'), by = EventTypeClean][order(-healthImpact)],
data[!(EventTypeClean %in% top20), .(EventTypeClean = 'Other', healthImpact=sum(humanHealth), barType = 'Other')],
data[, .(EventTypeClean = 'Total', healthImpact=sum(humanHealth), barType = 'Total')])
dt.plot[, EventTypeClean := factor(EventTypeClean, levels = EventTypeClean)]
# set up start and end of waterfall bars
dt.plot[, counter := 1]
dt.plot[, id := cumsum(counter)]
dt.plot[, end := cumsum(healthImpact)]
dt.plot[, start := c(0, head(dt.plot$end, -1))]
dt.plot[EventTypeClean == 'Total', `:=` (end = healthImpact, start = 0)]
# generate the plot
ggplot(data = dt.plot) +
geom_hline(yintercept = 0) +
geom_rect(mapping = aes(x = EventTypeClean, xmin = id - 0.45, xmax = id + 0.45,
ymin = start/1e3, ymax = end/1e3, fill = barType)) +
theme(legend.position = c(0.5, 0.5)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
scale_fill_discrete(name = "Event Category") +
scale_x_discrete(name = "Event Type") +
scale_y_continuous(name = "Impact to Health\n(Thousands; Fatalities + Injuries)") +
ggtitle("Health Impact of Storms in U.S. of Last 25 Years")
Caption: A waterfall chart of the human health, aggregated by event type. Tornados are key.
Clearly, tornados, have the greatest impact on human health and their impact should be considered carefully for urban planners. Distant second and third place event types are Heat and Thunderstorms/Wind.
To expore this, let’s consider the cumulative distribution function of the underlying economic impact data, rolled up by event type:
ggplot(data=data[damage > 0, .(damage = sum(damage)), by=EventTypeClean], mapping=aes(x=damage/1e9)) +
stat_ecdf(geom="step") +
scale_x_continuous(name = "Storm Economic Damage ($B)\nRank order of Event Type by Damage") +
scale_y_continuous(name = "CDF", labels = percent) +
ggtitle("Cumulative Distribution Function of Damage by Event Type")
Caption: The CDF shows both that many event types have zero economic consequence and a few have large impacts.
This is very difficult to glean much from the graph because the distribution is so long-tailed. The one aspect that we can clearly see is that most event types have zero or near zero economic consequence. It may be easier to see the events with significant consequence if we just look at a table of results of the top 20 event types, sorted in descending order by total economic damage. Here are our results:
format_big <- function(x) { return( format(round(x/1e9, 1), big.mark = ',', big.interval = 3L) ) }
tableToGrob <- copy(data)
tableToGrob <- tableToGrob[, .(propertyDamage = sum(propertyDamage), cropDamage=sum(cropDamage), damage=sum(damage)), by=EventTypeClean][order(-damage)][1:20]
tableToGrob[, `:=` (propertyDamage = format_big(propertyDamage),
cropDamage = format_big(cropDamage),
damage = format_big(damage))]
setnames(tableToGrob, c("EventTypeClean", "propertyDamage", "cropDamage", "damage"),
c('Event Type', 'Property Damage\n($B)', 'Crop Damage\n($B)', 'Total Damage\n($B)'))
grid.arrange(tableGrob(tableToGrob, theme = ttheme_minimal(), rows = NULL))
# fraction caused by top 20 events:
top20economic <- round(data[order(-damage)][1:20][, sum(damage)]/sum(data$damage)*100,0)
Caption: Table with Top 20 Economic Damage Event Types
This shows that hurricanes and floods should receive the vast majority of focus for urban safety planning. We should be somewhat cautious of any of our conclusions here, given that the top 20 events account for 84% of all economic consequences over the last 25 years.
Tornados, heat and Thunderstorms/Wind are the three most dangerous storm event types (i.e., highest impact to human health) in the United States over the last 25 years. Hurricanes, floods and tornados, however, have had the most significant economic impacts. Planning for urban centers should determine how these types of storm can effect the specific city environment and look for ways to protect people and property should a city be unfortunate enough to experience a devastating storm. Further study is required to help urban planners better understand the specific risks for their geography and what the optimal understanding of human health is with regards to the balance between injuries and fatalities.