Storms and other severe weather events can cause both public health and economic problems for communities and municipalities, causing fatalities, injuries, and property damage. Here we use the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database, which characterizes of major storms and weather events in the Unitied States, to explore which types of events are (1) most harmful with respect to population health and (2) have the greatest economic consequences. From our analysis of the top-1000 most costly events, flood events were most frequent, while hurricane and storm events where the most costly. Tornados on the other hand caused by far the most injuries and deaths over all other event categories. The next leading event categories were wind-related for injuries and heat-related for fatalities.
First things first, we'll download the NOAA Storm Database and its supporting documentation.
## Download the NOAA Storm database, documentation, and FAQ:
setwd("~/Desktop/data_science/05_reproduce/assign/RepData_PeerAssessment2/")
download_data_date = date() # date of download
if (!file.exists("repdata-data-StormData.csv.bz2")) {
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
"repdata-data-StormData.csv.bz2", method = "curl")
}
if (!file.exists("repdata-peer2_doc-pd01016005curr.pdf")) {
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf",
"repdata-peer2_doc-pd01016005curr.pdf", method = "curl")
}
if (!file.exists("repdata-peer2_doc-NCDC_Storm_Events-FAQ-Page.pdf")) {
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf",
"repdata-peer2_doc-NCDC_Storm_Events-FAQ-Page.pdf", method = "curl")
}
The above files were downloaded Fri Jun 20 18:29:29 2014. Next lets load the data file repdata-data-StormData.csv
using the bzfile
and the read.csv
functions.
# unzip and read in repdata-data-StormData.csv
if (file.exists("storm_raw.RData")) {
# if its available load the cached datafile
load("storm_raw.RData")
} else {
storm_raw <- read.csv("repdata-data-StormData.csv.bz2")
# that took a while to load, lets save the R file to disk
save(storm_raw, file = "storm_raw.RData")
}
Using the head()
function we see that there are a number of different paramters that could interest us as far as population health and economic cost:
EVTYPE
: the hazardous event type that we are intrested in (40 documented yet more than 900 unique in dataset)FATALITIES
: number of fatalities (i.e., deaths) associated with the eventINJURIES
: number of injuries associated with the eventPROPDMG
: property damange amountPROPDMGEXP
: presumably the exponent of the value, we will assume after decoding this represents \( 10^x \)CROPDMG
: crop damange amountCROPDMGEXP
: crop damange expoented, we will assume after decoding this represents \( 10^x \)To be prudent we will look at the distribution of population health variables (FATALITIES
and INJURIES
) and economic consequence variables (PROPDMG
and CROPDMG
) to see if these distributions are skewed to large values, so the overall question can be answered by looking a fewer data points, and simplier EVTYPE
categories.
For simplicity we will assume that the dollar amounts in the database are corrected for inflation.
Here we cleaned the PROPDMG
and CROPDMG
variables to create values representing their total costs as new variables TOTALPROPDMG
, TOTALCROPDMG
, and TOTALDMG
for the total property damage costs, total crop damages costs, and the grand total costs, respectively. We also noted that one event has a total cost that was significantly larger than all others. Upon inspection of the event it was determined data error and the value was corrected to read millions of dollars instead of billions. Finally, we noticed that the top-1000 most costly events, represent approximately 80% of the total cumulative costs. This sample will likely good enough for us to generalize some general trends, and so will be the final dataset that we do our analyses on.
# first create a new copy of the storm data that we are modifying
storm <- storm_raw
## changing exponent variables into a numerical value property damange
storm$PROPDMGEXP <- chartr("-?+hHkKmMbB", "11122336699", storm$PROPDMGEXP)
storm$PROPDMGEXP[storm$PROPDMGEXP == ""] <- 0
storm$PROPDMGEXP <- as.numeric(storm$PROPDMGEXP)
# crop damage
storm$CROPDMGEXP <- chartr("-?+hHkKmMbB", "11122336699", storm$CROPDMGEXP)
storm$CROPDMGEXP[storm$CROPDMGEXP == ""] <- 0
storm$CROPDMGEXP <- as.numeric(storm$CROPDMGEXP)
## create new TOTALPROPDMG, TOTALCROPDMG, and TOTALDMG variables
storm$TOTALPROPDMG <- storm$PROPDMG * (10^storm$PROPDMGEXP)
storm$TOTALCROPDMG <- storm$CROPDMG * (10^storm$CROPDMGEXP)
storm$TOTALDMG <- storm$TOTALPROPDMG + storm$TOTALCROPDMG
# look at the top 10 we see one is much higher than the rest
table(sort(storm$TOTALDMG, decreasing = T)[1:10])
##
## 5.15e+09 5.705e+09 7.35e+09 7.39e+09 1e+10
## 1 1 1 1 2
## 1.126e+10 1.693e+10 3.13e+10 115032500000
## 1 1 1 1
# from description see that the values should have been in millions not
# billions
storm[storm$TOTALDMG == max(storm$TOTALDMG), "PROPDMGEXP"] = 6
# recalculate
storm$TOTALPROPDMG <- storm$PROPDMG * (10^storm$PROPDMGEXP)
storm$TOTALCROPDMG <- storm$CROPDMG * (10^storm$CROPDMGEXP)
storm$TOTALDMG <- storm$TOTALPROPDMG + storm$TOTALCROPDMG
# top 1000 events account for 80% of total damages
prop_cost_top_1000 = sum(sort(storm$TOTALDMG, decreasing = T)[1:1000])/sum(storm$TOTALDMG)
# subset original dataset for just these
storm_cost1000 <- storm[storm$TOTALDMG >= min(sort(storm$TOTALDMG, decreasing = T)[1:1000]),
]
Now that we have isolated the events we are interested in we can now tackle the troublesome EVTYPE variable for just the most costly events.
# pattern and replacements based on generally equivalent types
keywords <- c("FLOOD", "TORNADO", "HURRICANE", "WIND", "FIRE", "(ICE|FROST|SNOW|HAIL|FREEZ|BLIZZARD|WINTER)",
"COLD", "THUNDER", "(HEAT|HOT)", "(STORM|WET|RAIN)") # regular expressions
replace <- c("FLOOD", "TORNADO", "HURRICANE", "WIND", "FIRE", "ICE", "COLD",
"THUNDER", "HEAT", "STORM") # replacements
# change EVTYPE to character type to avoid NAs
storm_cost1000$EVTYPE <- as.character(storm_cost1000$EVTYPE)
# for each regular expression in EVTYPE, if found replace with replacement
# value
for (index in 1:length(keywords)) {
# find rows that match the partcular EVTYPE pattern
rows <- grep(keywords[index], storm_cost1000$EVTYPE, ignore.case = T)
if (length(rows) > 0) {
# if not empty replace with the replacement name
storm_cost1000[rows, "EVTYPE"] = replace[index]
}
}
# refactor EVTYPE based on our replacements above
storm_cost1000$EVTYPE <- factor(storm_cost1000$EVTYPE, levels = unique(storm_cost1000$EVTYPE))
# print a table of our replacements
library(xtable)
tab <- xtable(cbind(keywords, replace))
print(tab, type = "html")
keywords | replace | |
---|---|---|
1 | FLOOD | FLOOD |
2 | TORNADO | TORNADO |
3 | HURRICANE | HURRICANE |
4 | WIND | WIND |
5 | FIRE | FIRE |
6 | (ICE|FROST|SNOW|HAIL|FREEZ|BLIZZARD|WINTER) | ICE |
7 | COLD | COLD |
8 | THUNDER | THUNDER |
9 | (HEAT|HOT) | HEAT |
10 | (STORM|WET|RAIN) | STORM |
Table 1: Table of patterns that were amalgamated into categories for the top-1000 total costs subset.
The population health variables, FATALITIES
and INJURIES
, seem to be well formed so we'll continue with them as-is. We will note many events did not report any injuries or fatalities, so we subset the original data frame by those events where we observed at least one injury or fatality going forward in our analysis.
# select the subset of events that had at least one injury or one fatalitiy
storm_hurt <- storm[(storm$INJURIES > 0 | storm$FATALITIES > 0), c("INJURIES",
"FATALITIES", "EVTYPE")]
# pattern and replacements based on generally equivalent types
keywords <- c("FLOOD", "(TORNADO|FUNNEL CLOUD)", "HURRICANE", "(WIND|GLAZE|DUST|DRY MICROBURST|STREAM)",
"FIRE", "(ICE|FROST|SNOW|HAIL|FREEZ|BLIZZARD|WINTER|SLEET|WINTRY|ICY)",
"(COLD|HYPOTHERMIA|LOW TEMP|FOG)", "THUNDER", "(HEAT|HOT|WARM|HIGH|HYPERTHERMIA|DROUGHT)",
"(STORM|WET|RAIN|TYPHOON|PRECIP)", "(SURF|DROWNING|SEA|WATER|CURRENT|MARINE|SWELLS|WAVE)",
"LIGHTNING", "MUD", "AVAL", "LANDSLIDE") # regular expressions
replace <- c("FLOOD", "TORNADO", "HURRICANE", "WIND", "FIRE", "ICE", "COLD",
"THUNDER", "HEAT", "STORM", "DROWNING", "LIGHTNING", "MUD", "AVALANCHE",
"LANDSLIDE") # replacements
# change EVTYPE to character type to avoid NAs
storm_hurt$EVTYPE <- as.character(storm_hurt$EVTYPE)
# for each regular expression in EVTYPE, if found replace with replacement
# value
for (index in 1:length(keywords)) {
# find rows that match the partcular EVTYPE pattern
rows <- grep(keywords[index], storm_hurt$EVTYPE, ignore.case = T)
if (length(rows) > 0) {
# if not empty replace with the replacement name
storm_hurt[rows, "EVTYPE"] = replace[index]
}
}
# all leftovers gets set to other
storm_hurt$EVTYPE[is.na(storm_hurt$EVTYPE)] = "OTHER"
# refactor EVTYPE based on our replacements above
storm_hurt$EVTYPE <- factor(storm_hurt$EVTYPE, levels = unique(storm_hurt$EVTYPE))
# plot a table of the conversion
library(xtable)
tab <- xtable(cbind(keywords, replace))
print(tab, type = "html")
keywords | replace | |
---|---|---|
1 | FLOOD | FLOOD |
2 | (TORNADO|FUNNEL CLOUD) | TORNADO |
3 | HURRICANE | HURRICANE |
4 | (WIND|GLAZE|DUST|DRY MICROBURST|STREAM) | WIND |
5 | FIRE | FIRE |
6 | (ICE|FROST|SNOW|HAIL|FREEZ|BLIZZARD|WINTER|SLEET|WINTRY|ICY) | ICE |
7 | (COLD|HYPOTHERMIA|LOW TEMP|FOG) | COLD |
8 | THUNDER | THUNDER |
9 | (HEAT|HOT|WARM|HIGH|HYPERTHERMIA|DROUGHT) | HEAT |
10 | (STORM|WET|RAIN|TYPHOON|PRECIP) | STORM |
11 | (SURF|DROWNING|SEA|WATER|CURRENT|MARINE|SWELLS|WAVE) | DROWNING |
12 | LIGHTNING | LIGHTNING |
13 | MUD | MUD |
14 | AVAL | AVALANCHE |
15 | LANDSLIDE | LANDSLIDE |
Table 2: Table of patterns that for the EVTYPE
variable that were amalgamated into event categories for population health related variables.
Now we will analyize the processed data for Economic and Population Health costs.
Here we sum the total costs of propery, copy, and total damages for each of the event categories. We find that the frequent events (Figure 1A) are not nessisarily the most costly (Figure 1B). Further, we can see that drought events have a greater crop cost than property damage costs.
# calcualte based on the top-1000 the respective costs in each EVTYPE
Property <- tapply(storm_cost1000$TOTALPROPDMG, storm_cost1000$EVTYPE, sum)
Crops <- tapply(storm_cost1000$TOTALCROPDMG, storm_cost1000$EVTYPE, sum)
Total <- tapply(storm_cost1000$TOTALDMG, storm_cost1000$EVTYPE, sum)
order <- names(sort(Total)) # want to sort plots by total value
# create dataframe of results
damage_df <- data.frame(Property = Property[order], Crops = Crops[order], row.names = order)
# plot type frequency and costs as barplots
par(mar = c(5.1, 8.1, 4.1, 2.1))
par(mfrow = c(1, 2))
barplot(sort(table(storm_cost1000$EVTYPE)), horiz = T, las = 1, xlab = "Frequency",
main = "(A) Most Occurences")
barplot(t(as.matrix(damage_df/10^9)), horiz = T, las = 1, xlab = "Cost in US Dollars (Trillions)",
main = "(B) Cummulative Costs", legend = names(damage_df), col = c("salmon",
"skyblue1"), args.legend = list(x = 80, y = 3))
Figure 1: Top-1000 most costly weather events by event category. (A) Bar plot of the frequey of events. (B) Cumulative costs in USD for each event type, divided into property and crop-related costs.
Now we will assess injury and fatailities accross event category. A simple count shows that Tornado events have by far the largest overall contribution to cummulative injuries and fatalities (Figure 2). Injuries in particular are more than four times the next leading category. However, fatailities have a less strong lead. Heat-relatied failities look to be on a simmilar scale as Tornados.
library(ggplot2)
library(reshape2)
storm_hurt.m <- melt(storm_hurt, id.vars = c("EVTYPE"))
storm_hurt.m$EVTYPE <- factor(storm_hurt.m$EVTYPE, levels = names(sort(tapply(storm_hurt.m$value,
storm_hurt.m$EVTYPE, sum))))
ggplot(data = storm_hurt.m, aes(EVTYPE, fill = variable)) + geom_bar(aes(weight = value)) +
coord_flip() + theme_bw() + labs(y = "Frequency", x = "Event Category")
Figure 2: Injuries and Fatalities by weather event category.