Synopsis

In this report, we examine severe weather events from the NOAA Storm Database and determine the top ten events that cause fatalities, injuries, crop damage, and property damage since the year 2000.

We show this in tabular format first, then in an aggregated visual format – stacked bar charts.

The data file required some cleaning. The most cumbersome cleaning task was dealing with the event type data which had a multitude of coding issues. We used a fuzzy string matching approach described in the report.

Data Processing

First we download the original Storm Data (.csv.bz) file and read it into a variable storm.
Documentation is available from National Weather Service Storm Data Documentation and the NCDC Storm Events FAQ

download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", 
              destfile="stormdata.bz.csv", 
              method = "curl", mode = "wb")
storm <- read.csv("temp", stringsAsFactors = FALSE)
str(storm)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete. We will therefore retain only the data since 2000.

storm$BGN_DATE <- as.Date(storm$BGN_DATE, format = "%m/%d/%Y")
storm <- storm[storm$BGN_DATE >= "2000-01-01",]

In addition to EVTYPE, the relevant columns are:

rel_cols <- c("EVTYPE", "INJURIES", "FATALITIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
storm <- storm[,rel_cols]

The numeric fields INJURIES and FATALITIES need no cleaning.

EVTYPE has some duplicate categories (e.g., THUNDERSTORM WINDS vs Thunderstorm Wind), odd or miscoded fields (e.g., Summary September 3) and a variety of other issues. There are 48 actual event types identified in the Storm Data Documentation. After date restriction, there were 195 unique event types in the data set. I use a fuzzy string matching approach available stringdist package. I manually enter the values to real_events – it is less painful for you to read it than it was for me to type them in. Through trial and error, a maximum string distance of 8 seems to give the best results.
note: this is an imperfect match but a best attempt given the time constraints

library(stringdist)
library(dplyr)
real_events <- c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood", "Cold/Wind Chill",
                 "Debris Flow", "Dense Fog", "Dense Smoke", "Drought", "Dust Devil", "Dust Storm",
                 "Excessive Heat", "Extreme Cold/Wind Chill", "Flash Flood", "Flood", "Freezing Fog",
                 "Frost/Freeze", "Funnel Cloud", "Hail", "Heat", "Heavy Rain", "Heavy Snow",
                 "High Surf", "High Wind", "Hurricane (Typhoon)", "Ice Storm", "Lake-Effect Snow",
                 "Lakeshore Flood", "Lightning", "Marine Hail", "Marine High Wind", "Marine Strong Wind",
                 "Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet", "Storm Surge/Tide",
                 "Strong Wind", "Thunderstorm Wind", "Tornado", "Tropical Depression", "Tropical Storm",
                 "Tsunami", "Volcanic Ash", "Waterspout", "Wildfire", "Winter Storm", "Winter Weather")
storm$EVTYPE <- toupper(storm$EVTYPE)
storm_events <- unique(storm$EVTYPE)

closest_distance <- function(string, stringVector) {
        stringVector[amatch(string, stringVector, maxDist=8)]
}
new_event <- {}
for(i in 1:length(storm_events)) {
        new_event <- append(new_event,
                            closest_distance(storm_events[i], toupper(real_events)))
}
lookup <- data.frame(EVTYPE = storm_events, new = new_event, stringsAsFactors = FALSE)
storm <- inner_join(storm, lookup)
storm <- storm[,-1]
colnames(storm)[7] <- "EVTYPE"
storm <- storm[complete.cases(storm),]

PROPDMGEXP and CROPDMGEXP have multipliers that effect PROPDMG, CROPDMG. Given the reduced data set, the transformations are as follows:

damage_mult <- function(old) {
        ifelse(old == "0", 0, 
               ifelse(old =="", 1, 
                      ifelse(old == "K", 1000, 
                             ifelse(old == "M", 1e+06, 
                                   ifelse(old == "B", 1e+09, NA)))))
}
storm$PROPDMG <- storm$PROPDMG * damage_mult(storm$PROPDMGEXP)
storm$CROPDMG <- storm$CROPDMG * damage_mult(storm$CROPDMGEXP)

Results

Across the United States, the following types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health (rank ordered). Results are shown first as tables for fatalaties and injuries. Tornados are the worst event in causing both death and injury. It is interesting to note that some events, like flash floods, have a higher ratio of deaths to injuries.

library(tidyr)
library(ggplot2)
library(knitr)
fatalities <- storm %>% group_by(EVTYPE) %>% summarize(fatalities = sum(FATALITIES)) %>%
        arrange(desc(fatalities))
injuries <- storm %>% group_by(EVTYPE) %>% summarize(injured = sum(INJURIES)) %>%
        arrange(desc(injured))
kable(fatalities %>% top_n(10), caption = "Top 10 weather events causing fatalities since 2000")
Top 10 weather events causing fatalities since 2000
EVTYPE fatalities
TORNADO 1193
EXCESSIVE HEAT 1013
FLASH FLOOD 600
LIGHTNING 467
RIP CURRENT 462
FLOOD 323
HIGH WIND 252
HEAT 231
AVALANCHE 179
EXTREME COLD/WIND CHILL 132
THUNDERSTORM WIND 132
kable(injuries  %>% top_n(10), caption = "Top 10 weather events causing injuries since 2000")
Top 10 weather events causing injuries since 2000
EVTYPE injured
TORNADO 15213
EXCESSIVE HEAT 3708
LIGHTNING 2995
HIGH WIND 2433
THUNDERSTORM WIND 1400
HURRICANE (TYPHOON) 1275
WILDFIRE 1251
HEAT 1226
FLASH FLOOD 812
FLOOD 703

Next we display a vertical stacked bar chart for the top 10 fatalaties with injuries reported. I chose to sort by fatlities instead of the combined counts because death is orders of magnitude worse than an injury (usually).

health_harm <- storm %>% group_by(EVTYPE) %>% summarize(fatalities = sum(FATALITIES),
                                                        injured = sum(INJURIES)) %>%
        arrange(desc(fatalities)) %>% top_n(10)
health_harm$EVTYPE <- factor(health_harm$EVTYPE)
health_harm <- gather(health_harm, FATYPE, measurement, fatalities:injured)
health_harm$EVTYPE <- factor(health_harm$EVTYPE, levels = fatalities$EVTYPE)
ggplot(health_harm, aes(x = EVTYPE, y = measurement, fill=FATYPE)) +
        geom_bar(stat="identity") + ylab("count") +
        xlab("event type") + scale_fill_discrete(name = "harmful result",
                                                 labels = c("death", "injury")) +
        ggtitle("Top 10 fatality event types since 2000 with injury counts") +
        theme(axis.text.x=element_text(angle = -90, hjust = 0))

Across the United States, the following types of weather events have the greatest negative monetary consequences (rank ordered). Results are shown first as tables for property and crop damage. It is interesting to note that property damage is much larger then crop damage and while most weather events, like floods, are highly damaging to property and crops, some others, like drought, are far more damaging to crops than property.

property <- storm %>% group_by(EVTYPE) %>% summarize(property_damage = sum(PROPDMG)) %>%
        arrange(desc(property_damage))
crops <- storm %>% group_by(EVTYPE) %>% summarize(crop_damage = sum(CROPDMG)) %>%
        arrange(desc(crop_damage))
kable(property %>% top_n(10), caption = "Top 10 property damaging weather events since 2000")
Top 10 property damaging weather events since 2000
EVTYPE property_damage
FLOOD 134700812080
HURRICANE (TYPHOON) 69305840000
STORM SURGE/TIDE 47812123000
TORNADO 19460679560
HAIL 11988835920
FLASH FLOOD 11876805110
WILDFIRE 7438540000
TROPICAL STORM 7194930550
HIGH WIND 7101755700
THUNDERSTORM WIND 3382654440
kable(crops %>% top_n(10), caption = "Top 10 crop damaging weather events since 2000")
Top 10 crop damaging weather events since 2000
EVTYPE crop_damage
DROUGHT 9135585000
FLOOD 4222234400
HURRICANE (TYPHOON) 2607872800
HAIL 1785086200
FROST/FREEZE 1094086000
FLASH FLOOD 904738500
HIGH WIND 765957350
EXCESSIVE HEAT 492402000
SEICHE 448980000
TROPICAL STORM 412311000

Next we display a stacked bar chart for the top ten monetary damaging weather events sorted by total monetary damage. Notice that the graph reinforces that property damage is much larger than crop damage. It also reinforces that droughts are more damaging to crops.

econ_harm <- storm %>% group_by(EVTYPE) %>% summarize(property_dmg = sum(PROPDMG),
                                                        crop_dmg = sum(CROPDMG),
                                                      total_dmg = sum(property_dmg + crop_dmg)) %>%
        arrange(desc(property_dmg)) %>% top_n(10)
econ_harm$EVTYPE <- factor(econ_harm$EVTYPE)
econ_harm <- gather(econ_harm, FATYPE, measurement, property_dmg:crop_dmg)
econ_harm$EVTYPE <- reorder(econ_harm$EVTYPE, desc(econ_harm$total_dmg))
#econ_harm$EVTYPE <- factor(econ_harm$EVTYPE)
ggplot(econ_harm, aes(x = EVTYPE, y = measurement/1e+09, fill=FATYPE)) +
        geom_bar(stat="identity") + ylab("damage in billions of $US") +
        xlab("event type") + scale_fill_discrete(name = "damage type",
                                                 labels = c("property", "crop")) +
        ggtitle("Top 10 monetary damaging event types since 2000") +
        theme(axis.text.x=element_text(angle = -90, hjust = 0))