Reproducible Research Peer Assignment 2: Of US Exeme Weather Events (1950-2011) Huricanes and Storms drive Costs while Tornados drive Deaths and Injuries

Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities, causing fatalities, injuries, and property damage. Here we use the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database, which characterizes of major storms and weather events in the Unitied States, to explore which types of events are (1) most harmful with respect to population health and (2) have the greatest economic consequences. From our analysis of the top-1000 most costly events, flood events were most frequent, while hurricane and storm events where the most costly. Tornados on the other hand caused by far the most injuries and deaths over all other event categories. The next leading event categories were wind-related for injuries and heat-related for fatalities.

Data Processing

First things first, we'll download the NOAA Storm Database and its supporting documentation.

## Download the NOAA Storm database, documentation, and FAQ:
setwd("~/Desktop/data_science/05_reproduce/assign/RepData_PeerAssessment2/")
download_data_date = date()  # date of download
if (!file.exists("repdata-data-StormData.csv.bz2")) {

    download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", 
        "repdata-data-StormData.csv.bz2", method = "curl")
}

if (!file.exists("repdata-peer2_doc-pd01016005curr.pdf")) {
    download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf", 
        "repdata-peer2_doc-pd01016005curr.pdf", method = "curl")
}

if (!file.exists("repdata-peer2_doc-NCDC_Storm_Events-FAQ-Page.pdf")) {
    download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf", 
        "repdata-peer2_doc-NCDC_Storm_Events-FAQ-Page.pdf", method = "curl")
}

The above files were downloaded Fri Jun 20 18:29:29 2014. Next lets load the data file repdata-data-StormData.csv using the bzfile and the read.csv functions.

# unzip and read in repdata-data-StormData.csv
if (file.exists("storm_raw.RData")) {
    # if its available load the cached datafile
    load("storm_raw.RData")
} else {
    storm_raw <- read.csv("repdata-data-StormData.csv.bz2")
    # that took a while to load, lets save the R file to disk
    save(storm_raw, file = "storm_raw.RData")
}

Using the head() function we see that there are a number of different paramters that could interest us as far as population health and economic cost:

To be prudent we will look at the distribution of population health variables (FATALITIES and INJURIES) and economic consequence variables (PROPDMG and CROPDMG) to see if these distributions are skewed to large values, so the overall question can be answered by looking a fewer data points, and simplier EVTYPE categories.

For simplicity we will assume that the dollar amounts in the database are corrected for inflation.

Economic Consequences: Calculate Total Cost Variable

Here we cleaned the PROPDMG and CROPDMG variables to create values representing their total costs as new variables TOTALPROPDMG, TOTALCROPDMG, and TOTALDMG for the total property damage costs, total crop damages costs, and the grand total costs, respectively. We also noted that one event has a total cost that was significantly larger than all others. Upon inspection of the event it was determined data error and the value was corrected to read millions of dollars instead of billions. Finally, we noticed that the top-1000 most costly events, represent approximately 80% of the total cumulative costs. This sample will likely good enough for us to generalize some general trends, and so will be the final dataset that we do our analyses on.

# first create a new copy of the storm data that we are modifying
storm <- storm_raw

## changing exponent variables into a numerical value property damange
storm$PROPDMGEXP <- chartr("-?+hHkKmMbB", "11122336699", storm$PROPDMGEXP)
storm$PROPDMGEXP[storm$PROPDMGEXP == ""] <- 0
storm$PROPDMGEXP <- as.numeric(storm$PROPDMGEXP)

# crop damage
storm$CROPDMGEXP <- chartr("-?+hHkKmMbB", "11122336699", storm$CROPDMGEXP)
storm$CROPDMGEXP[storm$CROPDMGEXP == ""] <- 0
storm$CROPDMGEXP <- as.numeric(storm$CROPDMGEXP)

## create new TOTALPROPDMG, TOTALCROPDMG, and TOTALDMG variables
storm$TOTALPROPDMG <- storm$PROPDMG * (10^storm$PROPDMGEXP)
storm$TOTALCROPDMG <- storm$CROPDMG * (10^storm$CROPDMGEXP)
storm$TOTALDMG <- storm$TOTALPROPDMG + storm$TOTALCROPDMG

# look at the top 10 we see one is much higher than the rest
table(sort(storm$TOTALDMG, decreasing = T)[1:10])
## 
##     5.15e+09    5.705e+09     7.35e+09     7.39e+09        1e+10 
##            1            1            1            1            2 
##    1.126e+10    1.693e+10     3.13e+10 115032500000 
##            1            1            1            1
# from description see that the values should have been in millions not
# billions
storm[storm$TOTALDMG == max(storm$TOTALDMG), "PROPDMGEXP"] = 6

# recalculate
storm$TOTALPROPDMG <- storm$PROPDMG * (10^storm$PROPDMGEXP)
storm$TOTALCROPDMG <- storm$CROPDMG * (10^storm$CROPDMGEXP)
storm$TOTALDMG <- storm$TOTALPROPDMG + storm$TOTALCROPDMG

# top 1000 events account for 80% of total damages
prop_cost_top_1000 = sum(sort(storm$TOTALDMG, decreasing = T)[1:1000])/sum(storm$TOTALDMG)

# subset original dataset for just these
storm_cost1000 <- storm[storm$TOTALDMG >= min(sort(storm$TOTALDMG, decreasing = T)[1:1000]), 
    ]

Now that we have isolated the events we are interested in we can now tackle the troublesome EVTYPE variable for just the most costly events.

# pattern and replacements based on generally equivalent types
keywords <- c("FLOOD", "TORNADO", "HURRICANE", "WIND", "FIRE", "(ICE|FROST|SNOW|HAIL|FREEZ|BLIZZARD|WINTER)", 
    "COLD", "THUNDER", "(HEAT|HOT)", "(STORM|WET|RAIN)")  # regular expressions
replace <- c("FLOOD", "TORNADO", "HURRICANE", "WIND", "FIRE", "ICE", "COLD", 
    "THUNDER", "HEAT", "STORM")  # replacements

# change EVTYPE to character type to avoid NAs
storm_cost1000$EVTYPE <- as.character(storm_cost1000$EVTYPE)

# for each regular expression in EVTYPE, if found replace with replacement
# value
for (index in 1:length(keywords)) {
    # find rows that match the partcular EVTYPE pattern
    rows <- grep(keywords[index], storm_cost1000$EVTYPE, ignore.case = T)
    if (length(rows) > 0) {
        # if not empty replace with the replacement name
        storm_cost1000[rows, "EVTYPE"] = replace[index]
    }
}

# refactor EVTYPE based on our replacements above
storm_cost1000$EVTYPE <- factor(storm_cost1000$EVTYPE, levels = unique(storm_cost1000$EVTYPE))

# print a table of our replacements
library(xtable)
tab <- xtable(cbind(keywords, replace))
print(tab, type = "html")
keywords replace
1 FLOOD FLOOD
2 TORNADO TORNADO
3 HURRICANE HURRICANE
4 WIND WIND
5 FIRE FIRE
6 (ICE|FROST|SNOW|HAIL|FREEZ|BLIZZARD|WINTER) ICE
7 COLD COLD
8 THUNDER THUNDER
9 (HEAT|HOT) HEAT
10 (STORM|WET|RAIN) STORM

Table 1: Table of patterns that were amalgamated into categories for the top-1000 total costs subset.

Population Health Related Variables

The population health variables, FATALITIES and INJURIES, seem to be well formed so we'll continue with them as-is. We will note many events did not report any injuries or fatalities, so we subset the original data frame by those events where we observed at least one injury or fatality going forward in our analysis.

# select the subset of events that had at least one injury or one fatalitiy
storm_hurt <- storm[(storm$INJURIES > 0 | storm$FATALITIES > 0), c("INJURIES", 
    "FATALITIES", "EVTYPE")]

# pattern and replacements based on generally equivalent types
keywords <- c("FLOOD", "(TORNADO|FUNNEL CLOUD)", "HURRICANE", "(WIND|GLAZE|DUST|DRY MICROBURST|STREAM)", 
    "FIRE", "(ICE|FROST|SNOW|HAIL|FREEZ|BLIZZARD|WINTER|SLEET|WINTRY|ICY)", 
    "(COLD|HYPOTHERMIA|LOW TEMP|FOG)", "THUNDER", "(HEAT|HOT|WARM|HIGH|HYPERTHERMIA|DROUGHT)", 
    "(STORM|WET|RAIN|TYPHOON|PRECIP)", "(SURF|DROWNING|SEA|WATER|CURRENT|MARINE|SWELLS|WAVE)", 
    "LIGHTNING", "MUD", "AVAL", "LANDSLIDE")  # regular expressions
replace <- c("FLOOD", "TORNADO", "HURRICANE", "WIND", "FIRE", "ICE", "COLD", 
    "THUNDER", "HEAT", "STORM", "DROWNING", "LIGHTNING", "MUD", "AVALANCHE", 
    "LANDSLIDE")  # replacements

# change EVTYPE to character type to avoid NAs
storm_hurt$EVTYPE <- as.character(storm_hurt$EVTYPE)

# for each regular expression in EVTYPE, if found replace with replacement
# value
for (index in 1:length(keywords)) {
    # find rows that match the partcular EVTYPE pattern
    rows <- grep(keywords[index], storm_hurt$EVTYPE, ignore.case = T)
    if (length(rows) > 0) {
        # if not empty replace with the replacement name
        storm_hurt[rows, "EVTYPE"] = replace[index]
    }
}

# all leftovers gets set to other
storm_hurt$EVTYPE[is.na(storm_hurt$EVTYPE)] = "OTHER"

# refactor EVTYPE based on our replacements above
storm_hurt$EVTYPE <- factor(storm_hurt$EVTYPE, levels = unique(storm_hurt$EVTYPE))

# plot a table of the conversion
library(xtable)
tab <- xtable(cbind(keywords, replace))
print(tab, type = "html")
keywords replace
1 FLOOD FLOOD
2 (TORNADO|FUNNEL CLOUD) TORNADO
3 HURRICANE HURRICANE
4 (WIND|GLAZE|DUST|DRY MICROBURST|STREAM) WIND
5 FIRE FIRE
6 (ICE|FROST|SNOW|HAIL|FREEZ|BLIZZARD|WINTER|SLEET|WINTRY|ICY) ICE
7 (COLD|HYPOTHERMIA|LOW TEMP|FOG) COLD
8 THUNDER THUNDER
9 (HEAT|HOT|WARM|HIGH|HYPERTHERMIA|DROUGHT) HEAT
10 (STORM|WET|RAIN|TYPHOON|PRECIP) STORM
11 (SURF|DROWNING|SEA|WATER|CURRENT|MARINE|SWELLS|WAVE) DROWNING
12 LIGHTNING LIGHTNING
13 MUD MUD
14 AVAL AVALANCHE
15 LANDSLIDE LANDSLIDE

Table 2: Table of patterns that for the EVTYPE variable that were amalgamated into event categories for population health related variables.

Results

Now we will analyize the processed data for Economic and Population Health costs.

Economic Costs: Property and Crop Damage

Here we sum the total costs of propery, copy, and total damages for each of the event categories. We find that the frequent events (Figure 1A) are not nessisarily the most costly (Figure 1B). Further, we can see that drought events have a greater crop cost than property damage costs.

# calcualte based on the top-1000 the respective costs in each EVTYPE
Property <- tapply(storm_cost1000$TOTALPROPDMG, storm_cost1000$EVTYPE, sum)
Crops <- tapply(storm_cost1000$TOTALCROPDMG, storm_cost1000$EVTYPE, sum)
Total <- tapply(storm_cost1000$TOTALDMG, storm_cost1000$EVTYPE, sum)
order <- names(sort(Total))  # want to sort plots by total value

# create dataframe of results
damage_df <- data.frame(Property = Property[order], Crops = Crops[order], row.names = order)

# plot type frequency and costs as barplots
par(mar = c(5.1, 8.1, 4.1, 2.1))
par(mfrow = c(1, 2))
barplot(sort(table(storm_cost1000$EVTYPE)), horiz = T, las = 1, xlab = "Frequency", 
    main = "(A) Most Occurences")
barplot(t(as.matrix(damage_df/10^9)), horiz = T, las = 1, xlab = "Cost in US Dollars (Trillions)", 
    main = "(B) Cummulative Costs", legend = names(damage_df), col = c("salmon", 
        "skyblue1"), args.legend = list(x = 80, y = 3))

plot of chunk unnamed-chunk-6

Figure 1: Top-1000 most costly weather events by event category. (A) Bar plot of the frequey of events. (B) Cumulative costs in USD for each event type, divided into property and crop-related costs.

Population Health: Injuries and Fatalities

Now we will assess injury and fatailities accross event category. A simple count shows that Tornado events have by far the largest overall contribution to cummulative injuries and fatalities (Figure 2). Injuries in particular are more than four times the next leading category. However, fatailities have a less strong lead. Heat-relatied failities look to be on a simmilar scale as Tornados.

library(ggplot2)
library(reshape2)
storm_hurt.m <- melt(storm_hurt, id.vars = c("EVTYPE"))
storm_hurt.m$EVTYPE <- factor(storm_hurt.m$EVTYPE, levels = names(sort(tapply(storm_hurt.m$value, 
    storm_hurt.m$EVTYPE, sum))))
ggplot(data = storm_hurt.m, aes(EVTYPE, fill = variable)) + geom_bar(aes(weight = value)) + 
    coord_flip() + theme_bw() + labs(y = "Frequency", x = "Event Category")

plot of chunk unnamed-chunk-7

Figure 2: Injuries and Fatalities by weather event category.