Executive Summary: Using NOAA-provided data spanning from 1996-2011, this report identifies weather-related events that cause (1) the greatest number of human casualties and (2) the greatest amount of economic damage. Tornado/Storm events account for the highest number of casualties (27,731). Heat-related fatality is high: Though heat-related injuries (7,615) are less than one-third of tornado/storm injuries (25,558), fatality totals are similar (2,036 versus 2,173, respectively). Rip currents have the highest fatality rate (51.87%). Flood events are responsible for the greatest amount of economic damage ($165.71 billion), followed by severe tropical events ($95.78 billion) and wave/tidal events ($48.14 billion). The two highest casualty-inducing events and four highest property damage-causing events involve water. These exploratory findings are used to suggest areas worthy of further research in which actionable steps could be taken.
All data processing code subsequently displayed is written with the goal(s) of identifying which weather-related events (1) cause most harm to human health and/or (2) cause the greatest amount of economic damage.
Data used for this report was accessed from the US National and Atmospheric Administration (NOAA) storm database. The dataset includes information on severe weather events and corresponding damage (including health and economic consequences). The raw file can be downloaded here, and documentation relating to the file can be downloaded here.
To begin, load the following packages/libraries:
library("data.table")
library("R.utils")
library("reshape2")
library("ggplot2")
library("dplyr")
The code below establishes a project file (if one hasn’t been established already):
#setup project file if one doesn't exist yet
mainDir <- "C:\\Users\\Bob\\Documents\\R\\05 Reproducible Research"
subDir <- "NOAA Storm Project"
if(!file.exists(subDir)){
setwd(file.path(mainDir, subDir))
} else {
dir.create(file.path(mainDir, subDir), showWarnings = FALSE)
setwd(file.path(mainDir, subDir))}
#download data
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, destfile = "./data.csv.bz2")
file <- "./data.csv.bz2"
The bunzip function from the r.utils library unzips the bz2 file into a csv file:
bunzip2("./data.csv.bz2", "./data.csv", remove = FALSE, overwrite = TRUE)
The file is read into the object ‘data’ using fread function from the data.table library. The raw file is quite large - approximately 0.5GB. By loading only the eight columns relevant to this report, the loaded file size is reduced to approximately 59MB, significantly reducing loading time:
data <- fread("./data.csv",
header = TRUE,
stringsAsFactors = FALSE,
select = c(2, 8, 23:28))
##
Read 0.0% of 967216 rows
Read 22.7% of 967216 rows
Read 39.3% of 967216 rows
Read 54.8% of 967216 rows
Read 70.3% of 967216 rows
Read 80.6% of 967216 rows
Read 89.9% of 967216 rows
Read 902297 rows and 8 (of 37) columns from 0.523 GB file in 00:00:10
The following code block uses the mutate and filter functions from the dyplr package to (1) standardize the EVTYPE variable into all capital letters, (2) filter out insignificant events and (3) discard observations occurring in or before 1995.
data <- mutate(data, DATE = as.Date(BGN_DATE, format = '%m/%d/%Y %H:%M:%S')) %>%
mutate(EVTYPE = toupper(EVTYPE)) %>%
filter(FATALITIES !=0 | INJURIES !=0 | PROPDMG !=0 | CROPDMG !=0) %>%
filter(year(DATE) > 1995)
Page six of the corresponding documentation file lists 48 distinct EVTYPE categories. However, the actual number of “unique” EVTYPE categories within the raw file is 222 (reduced to 186 after applying the above code block). Several reasons for the excess which are addressed within the above code block, include:
-Inconsistent abbreviation schemes (e.g., “URBAN/SML STREAM FLD”)
-Events logged using both upper and lower-case (e.g., Light snow and Light Snow)
Further problems relating to an excessive number of EVTYPE categories must still be addressed, including:
-Gradients of the same event (e.g., “WIND”, “HIGH WIND”, “STRONG WIND”)
-Plural and singular description of same event (e.g., “MUDSLIDE” and “MUDSLIDES”)
-Non-mutual exclusivity (e.g., “TYPHOON” and “HURRICANE/TYPHOON”)
-Overly-granular descriptors of an event (e.g., “HIGH SEAS” vs. “ROUGH SEAS”)
-Other unnecessary distinctions (e.g., separating “HEAVY FOG” from “FOG”)
While some of these granular distinctions may be helpful for a detailed analysis of a specific type of weather-related event, failure to adjust EVTYPE could lead to misleading results in this report.
To make EVTYPE more conducive to this analysis, we apply grep - a pattern recognition function that matches and modifies character strings. The following 14 commands recategorize 186 unique EVTYPE values to 21 unique values:
data[grep("WIND", data$EVTYPE), "EVTYPE"] <- "WIND EVENT" #145
data[grep("TIDE|TIDAL|TSUNAMI|SEICHE|WAVE|SURGE|SEAS|SURF|SWELL", data$EVTYPE), "EVTYPE"] <- "WAVE/TIDAL"
data[grep("TYPHOON|TROPICAL|HURRICANE|COASTAL|MARINE", data$EVTYPE), "EVTYPE"] <- "SEVERE TROPICAL"
data[grep("SNOW|ICE|WINTRY|MIXED|ICY|WINTE|BLIZZ|GLAZE", data$EVTYPE), "EVTYPE"] <- "SNOW/ICE"
data[grep("FROST|FREEZ|COLD|CHILL|THERMIA", data$EVTYPE), "EVTYPE"] <- "COLD WEATHER/FROST"
data[grep("MUD|LANDSLUMP|SLIDE|EROSI", data$EVTYPE), "EVTYPE"] <- "LANDSLIDE"
data[grep("TORNAD|FUNN|WHIRL|THUND|MICROB|HAIL|LIGHTN", data$EVTYPE), "EVTYPE"] <- "TORNADO/STORM"
data[grep("FIRE|SMOKE|VOLCANIC", data$EVTYPE), "EVTYPE"] <- "FIRE/SMOKE/ASH"
data[grep("HEAT|WARM", data$EVTYPE), "EVTYPE"] <- "HEAT-RELATED"
data[grep("RAIN", data$EVTYPE), "EVTYPE"] <- "RAIN"
data[grep("FOG", data$EVTYPE), "EVTYPE"] <- "FOG"
data[grep("CURRENT", data$EVTYPE), "EVTYPE"] <- "RIP CURRENTS"
data[grep("DUST|DEVIL", data$EVTYPE), "EVTYPE"] <- "DUST-RELATED"
data[grep("FLOOD|WATER|DROWN|STREAM", data$EVTYPE), "EVTYPE"] <- "FLOOD"
The following steps process data so that the economic consequence question can be explored more clearly.
To create a single measure in which damage to human health can be assessed, the “INJURIES” and “FATALITIES” are added together to create “CASUALTIES”. However, as some would consider death a significantly worse outcome than injury, it is probably worthwhile to allow INJURIES and FATALITIES to be distinguished. Thus, the code below is oriented to show harm to human health on a combined basis (CASUALTIES), as well as on an individual basis (INJURIES and FATALITIES).
data$CASUALTIES <- data$INJURIES + data$FATALITIES
Next, the variable ‘harm’ is established using the group_by function of the dyplr package.
harm <- group_by(data, EVTYPE)
The code below will tell us how many casualties were caused by each event type:
harm <- summarise(harm,
FATALITIES = sum(FATALITIES, na.rm = TRUE),
INJURIES = sum(INJURIES, na.rm = TRUE),
CASUALTIES = sum(CASUALTIES, na.rm = TRUE))
harm <- arrange(harm, desc(CASUALTIES))
topTen <- select(harm, -CASUALTIES)
topTen$EVTYPE <- reorder(topTen$EVTYPE, (topTen$INJURIES+topTen$FATALITIES))
topTen <- topTen[1:10,]
Finally, the code below will create the graph depicting the 10 events most harmful to human health, as measured by casualties (shown subsequently in the Results section). Additionally, the final graph will delineate fatalities from injuries. The following code block utilizes both the melt and ggplot functions (from the reshape and ggplot2 packages, respectively).
topTen.m <- melt(topTen, id.vars = c("EVTYPE"),
measure.vars = c("FATALITIES", "INJURIES"),
variable.name = "TYPE",
value.name = "HUMAN_CASUALTIES")
g <- ggplot(topTen.m, aes(x=EVTYPE, y=HUMAN_CASUALTIES, fill=TYPE)) +
geom_bar(stat="identity") +
labs(title = "Casualities by Weather-related Event, 1996-2011",
x = "",
y = "Human Casualities") +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.text.y = element_text(size = 10),
plot.title = element_text(size = 12, face = "bold", hjust = 0.5),
legend.title = element_blank(),
#legend.justification = c(0.1,0.9),
legend.position = c(0.15,0.80),
legend.text = element_text(size=8),
legend.key.width = unit(0.6, "line"),
legend.key.height = unit(0.2, "line"))
The following steps process data so that the economic consequence question can be explored more clearly.
Specifically, we want to explore the relationship between weather-related events and economic damage. To achieve this, we need to combine data dispersed across 4 variables - PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP - into a comprehensive damage variable we’ll call “TOTALDMG”.
We begin by identifying all the values in which we’ll need to transform in both the PROPDMGEXP and CROPDMGEXP variables:
unique(data$PROPDMGEXP)
## [1] "K" "" "M" "B"
unique(data$CROPDMGEXP)
## [1] "K" "" "M" "B"
The set of unique values for both PROPDMGEXP and CROPDMGEXP happen to be identical: “K”, “”, “M” and “B”. An extra step is needed to transform the blank “” values, which we’ll replace with placeholder “A” before continuing.
data$PROPDMGEXP <- sub("^$", "A", data$PROPDMGEXP)
data$CROPDMGEXP <- sub("^$", "A", data$CROPDMGEXP)
The next step calculates both property and crop damage, which involves a simple multiplication of the ‘face’ value by the ‘magnitude/exponent’ value. But in order to perform this multiplication, the letter values in the exponent variables (PROPDMGEXP and CROPDMGEXP) must be recoded into their respective numerical values:
data$PROPDMG <- data$PROPDMG * as.numeric(recode(data$PROPDMGEXP,
"A"="1",
"K"="1000",
"M"="1000000",
"B"="1000000000"))
data$CROPDMG <- data$CROPDMG * as.numeric(recode(data$CROPDMGEXP,
"A"="1",
"K"="1000",
"M"="1000000",
"B"="1000000000"))
Property and crop damage values are added together to create the total value variable (TOTALDMG).
data$TOTALDMG <- data$PROPDMG + data$CROPDMG
Steps to create a graph depicting the 10 events causing most economic damage are nearly identical to the previously-described steps in creating the ‘harm to human health’ graph. One exception is that economic damage is not delineated into separate (i.e., crop and property) categories. Total figures are adjusted to be expressed in $billions.
#establish economic consequence obj - "econ" -beginning by grouping by event type
econ <- group_by(data, EVTYPE)
#code below will tell us total damage caused by each event type
econ <- summarise(econ,
TOTALDMG = sum(TOTALDMG, na.rm = TRUE))
econ <- arrange(econ, desc(TOTALDMG))
topTenE <- econ
topTenE$TOTALDMG <- topTenE$TOTALDMG/1000000000
topTenE$EVTYPE <- reorder(topTenE$EVTYPE, topTenE$TOTALDMG)
topTenE <- topTenE[1:10,]
topTenE.m <- melt(topTenE, id.vars = c("EVTYPE"),
measure.vars = c("TOTALDMG"),
variable.name = "TYPE",
value.name = "ECON_DAMAGE")
e <- ggplot(topTenE.m, aes(x=EVTYPE, y=ECON_DAMAGE, fill=TYPE)) +
geom_bar(stat="identity") +
labs(title = "Economic Damage by Weather-related Event, 1996-2011",
x = "",
y = "Economic Damage, $USD") +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.text.y = element_text(size = 10),
plot.title = element_text(size = 12, face = "bold", hjust = 0.5),
legend.position = "none")
This section summarizes findings related to both (1) effect on human health and (2) cause of economic consequences.
Across the United States, which types of events are the most harmful with respect to population health?
Tornado/Storm events account for the highest number of casualties (27,731), followed by flood events (9,851). Heat-related fatality is high: Though heat-related injuries (7,615) are less than one-third of tornado/storm injuries (25,558), fatality totals are similar (2,036 versus 2,173, respectively). Rip currents have the highest fatality rate (51.87%).
#Five most harmful events to human health, by casualties:
head(harm, 5)
## # A tibble: 5 × 4
## EVTYPE FATALITIES INJURIES CASUALTIES
## <chr> <dbl> <dbl> <dbl>
## 1 TORNADO/STORM 2173 25558 27731
## 2 FLOOD 1337 8514 9851
## 3 HEAT-RELATED 2036 7615 9651
## 4 WIND EVENT 1021 6712 7733
## 5 SNOW/ICE 548 3595 4143
#Graphically depict results using chart "g" constructed in previous section
g
Across the United States, which types of events have the greatest economic consequences?
Flood events are responsible for the greatest amount of economic damage ($165.71 billion), followed by severe tropical events ($95.78 billion) and wave/tidal events ($48.14 billion).
#Five most economically harmful events, by total damage($bil):
head(econ, 5)
## # A tibble: 5 × 2
## EVTYPE TOTALDMG
## <chr> <dbl>
## 1 FLOOD 165713182510
## 2 SEVERE TROPICAL 95782029920
## 3 WAVE/TIDAL 48141126000
## 4 TORNADO/STORM 42744318810
## 5 WIND EVENT 15096615120
#Graphically depict results using chart "e" constructed in previous section
e
Though this report is exploratory in nature, these initial findings point to areas worthy of future study.
Any actionable item must be considered in the context of cost (both money and manpower) and regulatory constraints (e.g., jurisdictions, eminent domain laws, etc.). Further research may identify low-cost yet effective initiatives that reduce casualties, reduce economic damage, or both.
Water/Flood - Flooding is the single greatest cause of economic damage. Can we identify systemic problems relating to municipal zoning standards? Can stilts and related building-techniques play a role in tropical storm-prone areas (e.g., Coastal Carolina, Florida). Without encroaching on property rights, can we dissuade people from building on flood-prone areas? Where can engineered solutions (e.g., dams, levies, retention walls) play a greater, economical and safe role in flood control?
Rip Currents - Rip currents have sixth-highest death total (542) the highest fatality rate (51.87%). As both the mechanics and high-risk areas of rip currents are well-known, it may be worthwhile to inquire of signage at specific high-risk locations.
Heat-Related - The elderly are most susceptible to heat-related casualty. Relatedly, what warning systems or programs, if any, are in place where head-related casualties tend to occur? Can we assess the efficacy of programs already in place? It may be worth exploring the use of PSAs aired in places facing an imminent heat wave.
Snow/Ice - An inch of snow will likely do more harm to Dallas or Atlanta than it would to Chicago or Minneapolis. What is the historical frequency and magnitude of snow/ice events in southern and mid-southern metro areas? What are the costs of acquiring snow/ice preparedness infrastructure (e.g., plows, salt, rental space, et cetera)? How can we use this information to help municipalities in need?
Wave/Tidal - Both earthquake-generated and landslide-generated tsunamis are nearly impossible to accurately forecast. Tidal events, however, often occur with regularity, and in some cases, near-certain intervals. How many deaths and/or injuries occurred at places with known tidal phenomenon? Are warning signs clearly placed at these locations? Did these deaths/injuries occur at night, when warning signs might have not been seen? Which of these areas (if any) lack signage?