Executive Summary: Using NOAA-provided data spanning from 1996-2011, this report identifies weather-related events that cause (1) the greatest number of human casualties and (2) the greatest amount of economic damage. Tornado/Storm events account for the highest number of casualties (27,731). Heat-related fatality is high: Though heat-related injuries (7,615) are less than one-third of tornado/storm injuries (25,558), fatality totals are similar (2,036 versus 2,173, respectively). Rip currents have the highest fatality rate (51.87%). Flood events are responsible for the greatest amount of economic damage ($165.71 billion), followed by severe tropical events ($95.78 billion) and wave/tidal events ($48.14 billion). The two highest casualty-inducing events and four highest property damage-causing events involve water. These exploratory findings are used to suggest areas worthy of further research in which actionable steps could be taken.

Data Processing


All data processing code subsequently displayed is written with the goal(s) of identifying which weather-related events (1) cause most harm to human health and/or (2) cause the greatest amount of economic damage.

Data File Overview

Data used for this report was accessed from the US National and Atmospheric Administration (NOAA) storm database. The dataset includes information on severe weather events and corresponding damage (including health and economic consequences). The raw file can be downloaded here, and documentation relating to the file can be downloaded here.

Load Supplemental Packages

To begin, load the following packages/libraries:

library("data.table")
library("R.utils")
library("reshape2")
library("ggplot2")
library("dplyr")


Establish Project Directory, Download & Read Data

The code below establishes a project file (if one hasn’t been established already):

#setup project file if one doesn't exist yet
mainDir <- "C:\\Users\\Bob\\Documents\\R\\05 Reproducible Research"
subDir <- "NOAA Storm Project"

if(!file.exists(subDir)){
  setwd(file.path(mainDir, subDir))
} else {
  dir.create(file.path(mainDir, subDir), showWarnings = FALSE)
  setwd(file.path(mainDir, subDir))}

#download data
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, destfile = "./data.csv.bz2")
file <- "./data.csv.bz2"


The bunzip function from the r.utils library unzips the bz2 file into a csv file:

bunzip2("./data.csv.bz2", "./data.csv", remove = FALSE, overwrite = TRUE)


The file is read into the object ‘data’ using fread function from the data.table library. The raw file is quite large - approximately 0.5GB. By loading only the eight columns relevant to this report, the loaded file size is reduced to approximately 59MB, significantly reducing loading time:

data <- fread("./data.csv", 
              header = TRUE,
              stringsAsFactors = FALSE,
              select = c(2, 8, 23:28))
## 
Read 0.0% of 967216 rows
Read 22.7% of 967216 rows
Read 39.3% of 967216 rows
Read 54.8% of 967216 rows
Read 70.3% of 967216 rows
Read 80.6% of 967216 rows
Read 89.9% of 967216 rows
Read 902297 rows and 8 (of 37) columns from 0.523 GB file in 00:00:10


Initial Data Transformation

The following code block uses the mutate and filter functions from the dyplr package to (1) standardize the EVTYPE variable into all capital letters, (2) filter out insignificant events and (3) discard observations occurring in or before 1995.

data <- mutate(data, DATE = as.Date(BGN_DATE, format = '%m/%d/%Y %H:%M:%S')) %>% 
        mutate(EVTYPE = toupper(EVTYPE)) %>% 
        filter(FATALITIES !=0 | INJURIES !=0 | PROPDMG !=0 | CROPDMG !=0) %>% 
        filter(year(DATE) > 1995)


A Note on the Event Type Variable (EVTYPE)

Page six of the corresponding documentation file lists 48 distinct EVTYPE categories. However, the actual number of “unique” EVTYPE categories within the raw file is 222 (reduced to 186 after applying the above code block). Several reasons for the excess which are addressed within the above code block, include:
   -Inconsistent abbreviation schemes (e.g., “URBAN/SML STREAM FLD”)
   -Events logged using both upper and lower-case (e.g., Light snow and Light Snow)

Further problems relating to an excessive number of EVTYPE categories must still be addressed, including:
   -Gradients of the same event (e.g., “WIND”, “HIGH WIND”, “STRONG WIND”)
   -Plural and singular description of same event (e.g., “MUDSLIDE” and “MUDSLIDES”)
   -Non-mutual exclusivity (e.g., “TYPHOON” and “HURRICANE/TYPHOON”)
   -Overly-granular descriptors of an event (e.g., “HIGH SEAS” vs. “ROUGH SEAS”)
   -Other unnecessary distinctions (e.g., separating “HEAVY FOG” from “FOG”)

While some of these granular distinctions may be helpful for a detailed analysis of a specific type of weather-related event, failure to adjust EVTYPE could lead to misleading results in this report.

Event Type Reclassification

To make EVTYPE more conducive to this analysis, we apply grep - a pattern recognition function that matches and modifies character strings. The following 14 commands recategorize 186 unique EVTYPE values to 21 unique values:

data[grep("WIND", data$EVTYPE), "EVTYPE"] <- "WIND EVENT"  #145
data[grep("TIDE|TIDAL|TSUNAMI|SEICHE|WAVE|SURGE|SEAS|SURF|SWELL", data$EVTYPE), "EVTYPE"] <- "WAVE/TIDAL"
data[grep("TYPHOON|TROPICAL|HURRICANE|COASTAL|MARINE", data$EVTYPE), "EVTYPE"] <- "SEVERE TROPICAL"
data[grep("SNOW|ICE|WINTRY|MIXED|ICY|WINTE|BLIZZ|GLAZE", data$EVTYPE), "EVTYPE"] <- "SNOW/ICE"
data[grep("FROST|FREEZ|COLD|CHILL|THERMIA", data$EVTYPE), "EVTYPE"] <- "COLD WEATHER/FROST"
data[grep("MUD|LANDSLUMP|SLIDE|EROSI", data$EVTYPE), "EVTYPE"] <- "LANDSLIDE"
data[grep("TORNAD|FUNN|WHIRL|THUND|MICROB|HAIL|LIGHTN", data$EVTYPE), "EVTYPE"] <- "TORNADO/STORM"
data[grep("FIRE|SMOKE|VOLCANIC", data$EVTYPE), "EVTYPE"] <- "FIRE/SMOKE/ASH"
data[grep("HEAT|WARM", data$EVTYPE), "EVTYPE"] <- "HEAT-RELATED"
data[grep("RAIN", data$EVTYPE), "EVTYPE"] <- "RAIN"
data[grep("FOG", data$EVTYPE), "EVTYPE"] <- "FOG"
data[grep("CURRENT", data$EVTYPE), "EVTYPE"] <- "RIP CURRENTS"
data[grep("DUST|DEVIL", data$EVTYPE), "EVTYPE"] <- "DUST-RELATED"
data[grep("FLOOD|WATER|DROWN|STREAM", data$EVTYPE), "EVTYPE"] <- "FLOOD"


Data Processing: Human Casualty

The following steps process data so that the economic consequence question can be explored more clearly.

To create a single measure in which damage to human health can be assessed, the “INJURIES” and “FATALITIES” are added together to create “CASUALTIES”. However, as some would consider death a significantly worse outcome than injury, it is probably worthwhile to allow INJURIES and FATALITIES to be distinguished. Thus, the code below is oriented to show harm to human health on a combined basis (CASUALTIES), as well as on an individual basis (INJURIES and FATALITIES).

data$CASUALTIES <- data$INJURIES + data$FATALITIES


Next, the variable ‘harm’ is established using the group_by function of the dyplr package.

harm <- group_by(data, EVTYPE)

The code below will tell us how many casualties were caused by each event type:

harm <- summarise(harm,
                  FATALITIES = sum(FATALITIES, na.rm = TRUE),
                  INJURIES = sum(INJURIES, na.rm = TRUE),
                  CASUALTIES = sum(CASUALTIES, na.rm = TRUE))

harm <- arrange(harm, desc(CASUALTIES))

topTen <- select(harm, -CASUALTIES)
topTen$EVTYPE <- reorder(topTen$EVTYPE, (topTen$INJURIES+topTen$FATALITIES))
topTen <- topTen[1:10,]


Finally, the code below will create the graph depicting the 10 events most harmful to human health, as measured by casualties (shown subsequently in the Results section). Additionally, the final graph will delineate fatalities from injuries. The following code block utilizes both the melt and ggplot functions (from the reshape and ggplot2 packages, respectively).

topTen.m <- melt(topTen, id.vars = c("EVTYPE"),
                          measure.vars = c("FATALITIES", "INJURIES"),
                          variable.name = "TYPE",
                          value.name = "HUMAN_CASUALTIES")

g <- ggplot(topTen.m, aes(x=EVTYPE, y=HUMAN_CASUALTIES, fill=TYPE)) +
            geom_bar(stat="identity") +
            labs(title = "Casualities by Weather-related Event, 1996-2011",
                 x = "",
                 y = "Human Casualities") +
            theme(axis.text.x = element_text(angle = 45, hjust = 1),
                  axis.text.y = element_text(size = 10),
                  plot.title = element_text(size = 12, face = "bold", hjust = 0.5),
                  legend.title = element_blank(),
                  #legend.justification = c(0.1,0.9),
                  legend.position = c(0.15,0.80),
                  legend.text = element_text(size=8),
                  legend.key.width = unit(0.6, "line"),
                  legend.key.height = unit(0.2, "line"))



Data Processing: Economic Consequence

The following steps process data so that the economic consequence question can be explored more clearly.

Specifically, we want to explore the relationship between weather-related events and economic damage. To achieve this, we need to combine data dispersed across 4 variables - PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP - into a comprehensive damage variable we’ll call “TOTALDMG”.

We begin by identifying all the values in which we’ll need to transform in both the PROPDMGEXP and CROPDMGEXP variables:

unique(data$PROPDMGEXP)
## [1] "K" ""  "M" "B"
unique(data$CROPDMGEXP)
## [1] "K" ""  "M" "B"


The set of unique values for both PROPDMGEXP and CROPDMGEXP happen to be identical: “K”, “”, “M” and “B”. An extra step is needed to transform the blank “” values, which we’ll replace with placeholder “A” before continuing.

data$PROPDMGEXP <- sub("^$", "A", data$PROPDMGEXP)
data$CROPDMGEXP <- sub("^$", "A", data$CROPDMGEXP)


The next step calculates both property and crop damage, which involves a simple multiplication of the ‘face’ value by the ‘magnitude/exponent’ value. But in order to perform this multiplication, the letter values in the exponent variables (PROPDMGEXP and CROPDMGEXP) must be recoded into their respective numerical values:

data$PROPDMG <- data$PROPDMG * as.numeric(recode(data$PROPDMGEXP,
                                                 "A"="1",
                                                 "K"="1000",
                                                 "M"="1000000",
                                                 "B"="1000000000"))


data$CROPDMG <- data$CROPDMG * as.numeric(recode(data$CROPDMGEXP,
                                                 "A"="1",
                                                 "K"="1000",
                                                 "M"="1000000",
                                                 "B"="1000000000"))


Property and crop damage values are added together to create the total value variable (TOTALDMG).

data$TOTALDMG <- data$PROPDMG + data$CROPDMG


Steps to create a graph depicting the 10 events causing most economic damage are nearly identical to the previously-described steps in creating the ‘harm to human health’ graph. One exception is that economic damage is not delineated into separate (i.e., crop and property) categories. Total figures are adjusted to be expressed in $billions.

#establish economic consequence obj - "econ" -beginning by grouping by event type
econ <- group_by(data, EVTYPE)

#code below will tell us total damage caused by each event type                                   
econ <- summarise(econ,
                  TOTALDMG = sum(TOTALDMG, na.rm = TRUE))

econ <- arrange(econ, desc(TOTALDMG))

topTenE <- econ
topTenE$TOTALDMG <- topTenE$TOTALDMG/1000000000
topTenE$EVTYPE <- reorder(topTenE$EVTYPE, topTenE$TOTALDMG)
topTenE <- topTenE[1:10,]

topTenE.m <- melt(topTenE, id.vars = c("EVTYPE"),
                 measure.vars = c("TOTALDMG"),
                 variable.name = "TYPE",
                 value.name = "ECON_DAMAGE")

e <- ggplot(topTenE.m, aes(x=EVTYPE, y=ECON_DAMAGE, fill=TYPE)) +
            geom_bar(stat="identity") +
            labs(title = "Economic Damage by Weather-related Event, 1996-2011",
                 x = "",
                 y = "Economic Damage, $USD") +
            theme(axis.text.x = element_text(angle = 45, hjust = 1),
                  axis.text.y = element_text(size = 10),
                  plot.title = element_text(size = 12, face = "bold", hjust = 0.5),
                  legend.position = "none")


Results


This section summarizes findings related to both (1) effect on human health and (2) cause of economic consequences.

Effect on Human Health


Across the United States, which types of events are the most harmful with respect to population health?
Tornado/Storm events account for the highest number of casualties (27,731), followed by flood events (9,851). Heat-related fatality is high: Though heat-related injuries (7,615) are less than one-third of tornado/storm injuries (25,558), fatality totals are similar (2,036 versus 2,173, respectively). Rip currents have the highest fatality rate (51.87%).

#Five most harmful events to human health, by casualties:
head(harm, 5)
## # A tibble: 5 × 4
##          EVTYPE FATALITIES INJURIES CASUALTIES
##           <chr>      <dbl>    <dbl>      <dbl>
## 1 TORNADO/STORM       2173    25558      27731
## 2         FLOOD       1337     8514       9851
## 3  HEAT-RELATED       2036     7615       9651
## 4    WIND EVENT       1021     6712       7733
## 5      SNOW/ICE        548     3595       4143
#Graphically depict results using chart "g" constructed in previous section
g

Causes of Economic Consequense


Across the United States, which types of events have the greatest economic consequences?
Flood events are responsible for the greatest amount of economic damage ($165.71 billion), followed by severe tropical events ($95.78 billion) and wave/tidal events ($48.14 billion).

#Five most economically harmful events, by total damage($bil):
head(econ, 5)
## # A tibble: 5 × 2
##            EVTYPE     TOTALDMG
##             <chr>        <dbl>
## 1           FLOOD 165713182510
## 2 SEVERE TROPICAL  95782029920
## 3      WAVE/TIDAL  48141126000
## 4   TORNADO/STORM  42744318810
## 5      WIND EVENT  15096615120
#Graphically depict results using chart "e" constructed in previous section
e


Looking Ahead / Areas of Further Research


Though this report is exploratory in nature, these initial findings point to areas worthy of future study.

Any actionable item must be considered in the context of cost (both money and manpower) and regulatory constraints (e.g., jurisdictions, eminent domain laws, etc.). Further research may identify low-cost yet effective initiatives that reduce casualties, reduce economic damage, or both.

Water/Flood - Flooding is the single greatest cause of economic damage. Can we identify systemic problems relating to municipal zoning standards? Can stilts and related building-techniques play a role in tropical storm-prone areas (e.g., Coastal Carolina, Florida). Without encroaching on property rights, can we dissuade people from building on flood-prone areas? Where can engineered solutions (e.g., dams, levies, retention walls) play a greater, economical and safe role in flood control?

Rip Currents - Rip currents have sixth-highest death total (542) the highest fatality rate (51.87%). As both the mechanics and high-risk areas of rip currents are well-known, it may be worthwhile to inquire of signage at specific high-risk locations.

Heat-Related - The elderly are most susceptible to heat-related casualty. Relatedly, what warning systems or programs, if any, are in place where head-related casualties tend to occur? Can we assess the efficacy of programs already in place? It may be worth exploring the use of PSAs aired in places facing an imminent heat wave.

Snow/Ice - An inch of snow will likely do more harm to Dallas or Atlanta than it would to Chicago or Minneapolis. What is the historical frequency and magnitude of snow/ice events in southern and mid-southern metro areas? What are the costs of acquiring snow/ice preparedness infrastructure (e.g., plows, salt, rental space, et cetera)? How can we use this information to help municipalities in need?

Wave/Tidal - Both earthquake-generated and landslide-generated tsunamis are nearly impossible to accurately forecast. Tidal events, however, often occur with regularity, and in some cases, near-certain intervals. How many deaths and/or injuries occurred at places with known tidal phenomenon? Are warning signs clearly placed at these locations? Did these deaths/injuries occur at night, when warning signs might have not been seen? Which of these areas (if any) lack signage?