Synopsis

Weather-related events can cause significant losses, both in terms of lives, health, and the economy. Knowing which events cause the most damage can help guide allocation of scarce resources to where they can be of most help. Here, I use data collected by the National Oceanic & Atmospheric Administration, which tracks fatalities, injuries, and monetary damages on a per-event basis, to discover which weather events are the major offenders.

My analysis finds that tornadoes easily cause the most injuries, are also responsible for the most monetary damage, and generate the second-highest number of fatalities among weather events. Heat waves cause the most fatalities and are a distant second in injuries. Floods cause the most damage to property and crops.

With this knowledge, policymakers must choose which events to manage most heavily based on potential damages, ability to deal with large vs. frequent issues, and what resources are available.

Data processing

Loading all the necessary libraries for the analysis:

library(data.table)
library(plyr)
library(lubridate)
library(ggplot2)

The NOAA data from the years 1950 to 2011 can be downloaded here. We do that in a cached operation, reading it into the ‘stormData’ variable with 902,297 observations of 37 variables.

url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, "StormData.csv.bz2", quiet=TRUE, method="curl")
stormData <- data.table(read.csv("StormData.csv.bz2"))

Then we run some cleanup operations.

  1. Convert BGN_DATE to a date class, and use it to extract only events from the last 30 years, as records have become more complete and better recording of some events than others in the past may lead to biased results.
  2. Convert all event types to upper case, since we have instances of e.g. “Wintry mix” and “WINTRY MIX”.
  3. Convert similar event types (such as “RECORD HEAT” and “EXTREME HEAT”) to be the same.
stormClean <- stormData
stormClean$BGN_DATE <- parse_date_time(stormClean$BGN_DATE, orders="%m/%d/%Y %H:%M:%S")
stormClean <- stormClean[year(stormClean$BGN_DATE) > 1981,]

stormClean$EVTYPE = toupper(stormClean$EVTYPE)

stormClean[grepl("^HURRI",stormClean$EVTYPE),]$EVTYPE <- "HURRICANE"
stormClean[grepl("TORN",stormClean$EVTYPE),]$EVTYPE <- "TORNADO"
stormClean[grepl("HEAT",stormClean$EVTYPE),]$EVTYPE <- "HEAT"
stormClean[grepl("WARM",stormClean$EVTYPE),]$EVTYPE <- "HEAT"
stormClean[grepl("TSTM[ ]?W",stormClean$EVTYPE),]$EVTYPE <- "TSTM WIND"
stormClean[grepl("FLOOD",stormClean$EVTYPE),]$EVTYPE <- "FLOOD"

## Template for checking types:
## unique(stormClean[grepl("HURR",stormClean$EVTYPE),EVTYPE])

Taking this data, we extract a few summary statistics about the various event types. First, we find total, mean, and median fatalities, injuries, and monetary damage (as a broad first look, we combine the values of property and crop damages).

byEventFat <- stormClean[, list(numEvents = .N,
                               totalFat = sum(FATALITIES),
                               meanFat = mean(FATALITIES),
                               medianFat = median(FATALITIES)), by = EVTYPE]
byEventInj <- stormClean[, list(numEvents = .N,
                               totalInj = sum(INJURIES),
                               meanInj = mean(INJURIES),
                               medianInj = median(INJURIES)), by = EVTYPE]
byEventDmg <- stormClean[, list(numEvents = .N,
                               totalDmg = sum(PROPDMG + CROPDMG),
                               meanDmg = mean(PROPDMG + CROPDMG),
                               medianDmg = median(PROPDMG + CROPDMG)), by = EVTYPE]

Next, we find the number of events with fatalities, injuries, and property or crop damage, and merge that information into our prior tables.

fatRows <- stormClean[FATALITIES > 0, list(withFat = .N), by = EVTYPE]
injRows <- stormClean[INJURIES > 0, list(withInj = .N), by = EVTYPE]
dmgRows <- stormClean[PROPDMG > 0 | CROPDMG > 0, list(withDmg = .N), by = EVTYPE]

byEventFat <- merge(byEventFat, fatRows, by="EVTYPE", all=TRUE)
byEventInj <- merge(byEventInj, injRows, by="EVTYPE", all=TRUE)
byEventDmg <- merge(byEventDmg, dmgRows, by="EVTYPE", all=TRUE)

Results

In order to eliminate outliers caused by miscategorization, we limit our analysis to event types with more than five instances.

First, although hurricanes cause significant loss of life, the geographic spread of heat waves and the frequency of tornadoes helps them become the largest weather-related killers.

fatTop <- head(arrange(byEventFat[numEvents > 5,], totalFat, decreasing=TRUE), 10)
fatTop$EVTYPE <- with(fatTop, factor(EVTYPE, EVTYPE))
ggplot(fatTop, aes(x=EVTYPE, y=totalFat)) + geom_bar(stat="identity")

As for injuries, once again, tornadoes are a principal weather-related cause. In fact, they cause far more than any other weather-related event.

injTop <- head(arrange(byEventInj[numEvents > 5,], totalInj, decreasing=TRUE), 10)
injTop$EVTYPE <- with(injTop, factor(EVTYPE, EVTYPE))
ggplot(injTop, aes(x=EVTYPE, y=totalInj)) + geom_bar(stat="identity")

Similarly, the frequency of flooding and tornadoes mean that they hold the lead in terms of dollars of damage caused to property and crops.

dmgTop <- head(arrange(byEventDmg[numEvents > 5,], totalDmg, decreasing=TRUE), 10)
dmgTop$EVTYPE <- with(dmgTop, factor(EVTYPE, EVTYPE))
ggplot(dmgTop, aes(x=EVTYPE, y=totalDmg)) + geom_bar(stat="identity")

Compare these results to the large but infrequent fatalities, injuries, and monetary damages caused by hurricane events:

compEvents <- c("HURRICANE", "TORNADO", "FLOOD", "HEAT")
byEventFat[EVTYPE %in% compEvents,]
##       EVTYPE numEvents totalFat    meanFat medianFat withFat
## 1: HURRICANE       288      135 0.46875000         0      48
## 2:   TORNADO     37020     2250 0.06077796         0     745
## 3:     FLOOD     82731     1525 0.01843324         0     986
## 4:      HEAT      2975     3178 1.06823529         0     798
byEventInj[EVTYPE %in% compEvents,]
##       EVTYPE numEvents totalInj   meanInj medianInj withInj
## 1: HURRICANE       288     1328 4.6111111         0      30
## 2:   TORNADO     37020    36077 0.9745273         0    3474
## 3:     FLOOD     82731     8604 0.1039997         0     558
## 4:      HEAT      2975     9243 3.1068908         0     233
byEventDmg[EVTYPE %in% compEvents,]
##       EVTYPE numEvents   totalDmg    meanDmg medianDmg withDmg
## 1: HURRICANE       288   34570.04 120.034861    17.565     213
## 2:   TORNADO     37020 2125949.79  57.427061     2.500   21532
## 3:     FLOOD     82731 2800638.24  33.852344     0.000   32037
## 4:      HEAT      2975    4716.04   1.585224     0.000      66

Appendix - Software environment

sessionInfo()
## R version 3.1.3 (2015-03-09)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.3 (Yosemite)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_1.0.1    lubridate_1.3.3  plyr_1.8.1       data.table_1.9.4
## 
## loaded via a namespace (and not attached):
##  [1] chron_2.3-45     colorspace_1.2-6 digest_0.6.8     evaluate_0.5.5  
##  [5] formatR_1.0      grid_3.1.3       gtable_0.1.2     htmltools_0.2.6 
##  [9] knitr_1.9        labeling_0.3     MASS_7.3-39      memoise_0.2.1   
## [13] munsell_0.4.2    proto_0.3-10     Rcpp_0.11.5      reshape2_1.4.1  
## [17] rmarkdown_0.5.1  scales_0.2.4     stringr_0.6.2    tools_3.1.3     
## [21] yaml_2.1.13