In this report, we examine severe weather events from the NOAA Storm Database and determine the top ten events that cause fatalities, injuries, crop damage, and property damage since the year 2000.
We show this in tabular format first, then in an aggregated visual format – stacked bar charts.
The data file required some cleaning. The most cumbersome cleaning task was dealing with the event type data which had a multitude of coding issues. We used a fuzzy string matching approach described in the report.
First we download the original Storm Data (.csv.bz) file and read it into a variable storm.
Documentation is available from National Weather Service Storm Data Documentation and the NCDC Storm Events FAQ
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
destfile="stormdata.bz.csv",
method = "curl", mode = "wb")
storm <- read.csv("temp", stringsAsFactors = FALSE)
str(storm)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete. We will therefore retain only the data since 2000.
storm$BGN_DATE <- as.Date(storm$BGN_DATE, format = "%m/%d/%Y")
storm <- storm[storm$BGN_DATE >= "2000-01-01",]
In addition to EVTYPE, the relevant columns are:
INJURIES, FATALITIES for population healthPROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP for economic damagerel_cols <- c("EVTYPE", "INJURIES", "FATALITIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
storm <- storm[,rel_cols]
The numeric fields INJURIES and FATALITIES need no cleaning.
EVTYPE has some duplicate categories (e.g., THUNDERSTORM WINDS vs Thunderstorm Wind), odd or miscoded fields (e.g., Summary September 3) and a variety of other issues. There are 48 actual event types identified in the Storm Data Documentation. After date restriction, there were 195 unique event types in the data set. I use a fuzzy string matching approach available stringdist package. I manually enter the values to real_events – it is less painful for you to read it than it was for me to type them in. Through trial and error, a maximum string distance of 8 seems to give the best results.
note: this is an imperfect match but a best attempt given the time constraints
library(stringdist)
library(dplyr)
real_events <- c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood", "Cold/Wind Chill",
"Debris Flow", "Dense Fog", "Dense Smoke", "Drought", "Dust Devil", "Dust Storm",
"Excessive Heat", "Extreme Cold/Wind Chill", "Flash Flood", "Flood", "Freezing Fog",
"Frost/Freeze", "Funnel Cloud", "Hail", "Heat", "Heavy Rain", "Heavy Snow",
"High Surf", "High Wind", "Hurricane (Typhoon)", "Ice Storm", "Lake-Effect Snow",
"Lakeshore Flood", "Lightning", "Marine Hail", "Marine High Wind", "Marine Strong Wind",
"Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet", "Storm Surge/Tide",
"Strong Wind", "Thunderstorm Wind", "Tornado", "Tropical Depression", "Tropical Storm",
"Tsunami", "Volcanic Ash", "Waterspout", "Wildfire", "Winter Storm", "Winter Weather")
storm$EVTYPE <- toupper(storm$EVTYPE)
storm_events <- unique(storm$EVTYPE)
closest_distance <- function(string, stringVector) {
stringVector[amatch(string, stringVector, maxDist=8)]
}
new_event <- {}
for(i in 1:length(storm_events)) {
new_event <- append(new_event,
closest_distance(storm_events[i], toupper(real_events)))
}
lookup <- data.frame(EVTYPE = storm_events, new = new_event, stringsAsFactors = FALSE)
storm <- inner_join(storm, lookup)
storm <- storm[,-1]
colnames(storm)[7] <- "EVTYPE"
storm <- storm[complete.cases(storm),]
PROPDMGEXP and CROPDMGEXP have multipliers that effect PROPDMG, CROPDMG. Given the reduced data set, the transformations are as follows:
"0" to numeric 0K to 1000M to 1e+06B to 1e+09damage_mult <- function(old) {
ifelse(old == "0", 0,
ifelse(old =="", 1,
ifelse(old == "K", 1000,
ifelse(old == "M", 1e+06,
ifelse(old == "B", 1e+09, NA)))))
}
storm$PROPDMG <- storm$PROPDMG * damage_mult(storm$PROPDMGEXP)
storm$CROPDMG <- storm$CROPDMG * damage_mult(storm$CROPDMGEXP)
Across the United States, the following types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health (rank ordered). Results are shown first as tables for fatalaties and injuries. Tornados are the worst event in causing both death and injury. It is interesting to note that some events, like flash floods, have a higher ratio of deaths to injuries.
library(tidyr)
library(ggplot2)
library(knitr)
fatalities <- storm %>% group_by(EVTYPE) %>% summarize(fatalities = sum(FATALITIES)) %>%
arrange(desc(fatalities))
injuries <- storm %>% group_by(EVTYPE) %>% summarize(injured = sum(INJURIES)) %>%
arrange(desc(injured))
kable(fatalities %>% top_n(10), caption = "Top 10 weather events causing fatalities since 2000")
| EVTYPE | fatalities |
|---|---|
| TORNADO | 1193 |
| EXCESSIVE HEAT | 1013 |
| FLASH FLOOD | 600 |
| LIGHTNING | 467 |
| RIP CURRENT | 462 |
| FLOOD | 323 |
| HIGH WIND | 252 |
| HEAT | 231 |
| AVALANCHE | 179 |
| EXTREME COLD/WIND CHILL | 132 |
| THUNDERSTORM WIND | 132 |
kable(injuries %>% top_n(10), caption = "Top 10 weather events causing injuries since 2000")
| EVTYPE | injured |
|---|---|
| TORNADO | 15213 |
| EXCESSIVE HEAT | 3708 |
| LIGHTNING | 2995 |
| HIGH WIND | 2433 |
| THUNDERSTORM WIND | 1400 |
| HURRICANE (TYPHOON) | 1275 |
| WILDFIRE | 1251 |
| HEAT | 1226 |
| FLASH FLOOD | 812 |
| FLOOD | 703 |
Next we display a vertical stacked bar chart for the top 10 fatalaties with injuries reported. I chose to sort by fatlities instead of the combined counts because death is orders of magnitude worse than an injury (usually).
health_harm <- storm %>% group_by(EVTYPE) %>% summarize(fatalities = sum(FATALITIES),
injured = sum(INJURIES)) %>%
arrange(desc(fatalities)) %>% top_n(10)
health_harm$EVTYPE <- factor(health_harm$EVTYPE)
health_harm <- gather(health_harm, FATYPE, measurement, fatalities:injured)
health_harm$EVTYPE <- factor(health_harm$EVTYPE, levels = fatalities$EVTYPE)
ggplot(health_harm, aes(x = EVTYPE, y = measurement, fill=FATYPE)) +
geom_bar(stat="identity") + ylab("count") +
xlab("event type") + scale_fill_discrete(name = "harmful result",
labels = c("death", "injury")) +
ggtitle("Top 10 fatality event types since 2000 with injury counts") +
theme(axis.text.x=element_text(angle = -90, hjust = 0))
Across the United States, the following types of weather events have the greatest negative monetary consequences (rank ordered). Results are shown first as tables for property and crop damage. It is interesting to note that property damage is much larger then crop damage and while most weather events, like floods, are highly damaging to property and crops, some others, like drought, are far more damaging to crops than property.
property <- storm %>% group_by(EVTYPE) %>% summarize(property_damage = sum(PROPDMG)) %>%
arrange(desc(property_damage))
crops <- storm %>% group_by(EVTYPE) %>% summarize(crop_damage = sum(CROPDMG)) %>%
arrange(desc(crop_damage))
kable(property %>% top_n(10), caption = "Top 10 property damaging weather events since 2000")
| EVTYPE | property_damage |
|---|---|
| FLOOD | 134700812080 |
| HURRICANE (TYPHOON) | 69305840000 |
| STORM SURGE/TIDE | 47812123000 |
| TORNADO | 19460679560 |
| HAIL | 11988835920 |
| FLASH FLOOD | 11876805110 |
| WILDFIRE | 7438540000 |
| TROPICAL STORM | 7194930550 |
| HIGH WIND | 7101755700 |
| THUNDERSTORM WIND | 3382654440 |
kable(crops %>% top_n(10), caption = "Top 10 crop damaging weather events since 2000")
| EVTYPE | crop_damage |
|---|---|
| DROUGHT | 9135585000 |
| FLOOD | 4222234400 |
| HURRICANE (TYPHOON) | 2607872800 |
| HAIL | 1785086200 |
| FROST/FREEZE | 1094086000 |
| FLASH FLOOD | 904738500 |
| HIGH WIND | 765957350 |
| EXCESSIVE HEAT | 492402000 |
| SEICHE | 448980000 |
| TROPICAL STORM | 412311000 |
Next we display a stacked bar chart for the top ten monetary damaging weather events sorted by total monetary damage. Notice that the graph reinforces that property damage is much larger than crop damage. It also reinforces that droughts are more damaging to crops.
econ_harm <- storm %>% group_by(EVTYPE) %>% summarize(property_dmg = sum(PROPDMG),
crop_dmg = sum(CROPDMG),
total_dmg = sum(property_dmg + crop_dmg)) %>%
arrange(desc(property_dmg)) %>% top_n(10)
econ_harm$EVTYPE <- factor(econ_harm$EVTYPE)
econ_harm <- gather(econ_harm, FATYPE, measurement, property_dmg:crop_dmg)
econ_harm$EVTYPE <- reorder(econ_harm$EVTYPE, desc(econ_harm$total_dmg))
#econ_harm$EVTYPE <- factor(econ_harm$EVTYPE)
ggplot(econ_harm, aes(x = EVTYPE, y = measurement/1e+09, fill=FATYPE)) +
geom_bar(stat="identity") + ylab("damage in billions of $US") +
xlab("event type") + scale_fill_discrete(name = "damage type",
labels = c("property", "crop")) +
ggtitle("Top 10 monetary damaging event types since 2000") +
theme(axis.text.x=element_text(angle = -90, hjust = 0))