Title: Most Harmful Weather Events for Population Health and Economy in US

Martin Hediger

Synopsis

US Storm Data is analysed to determine the most harmful events for 1) population health and 2) economic damage. It is found that tornado events result in most fatalities as well as injuries, even when all high-temperature related events are considered as one category. It is found that flood events cause the largest amount of economical damage. Both findings appear reasonable because i) tornados are very fast and can result in debries becoming as dangerous as bullets and ii) floods can inflict large amounts of damage because they affect large areas (but are maybe better evaded and so lead to less fatalities).

Data Processing

Because the two research questions focus on different aspects of the dataset and to be more efficient, the dataset is subsetted such that it contains only these variables which are relevant for the specific research question. The respective variables are written to disc (in rds format) and then reloaded, no external data processing is done.

To note, the data processing is inefficient at first run but is efficient on subsequent runs.

Processing for Research Question 1
For research question 1, the following variables are considered relevant and extracted to a separate, reduced dataset dred1 (subsequently then loaded again if available as d1):

STATE
EVTYPE
FATALITIES
INJURIES

# Initial load and dataset reduction.
if(!file.exists("./repdata_data_StormData.csv.bz2"))
{
    temp <- "./repdata_data_StormData.csv.bz2"
    download.file(
        "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
        temp, mode="wb"
    )
}
if(!file.exists("./dred1.rds"))
{
    d <- read.csv("repdata_data_StormData.csv.bz2", header=T)
    dred1 <- d[, c("STATE", "EVTYPE", "FATALITIES", "INJURIES")]
    saveRDS(dred1, file="./dred1.rds")
    rm(ls="d")
}
d1 <- readRDS("./dred1.rds")

This transformation is justified because it drastically increases dataset handling efficiency and reduces memory requirements of the computer.
As a second transformation, all EVTYPE factor labels are converted to upper case, this gives a reduction of length(levels(d1$EVTYPE)) - length(levels(factor(toupper(d1$EVTYPE)))) = 87 factors which only differ in their factor level labels cases.

d1$EVTYPE <- factor(toupper(d1$EVTYPE))
str(d1)

## 'data.frame':    902297 obs. of  4 variables:
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 898 levels "   HIGH SURF ADVISORY",..: 758 758 758 758 758 758 758 758 758 758 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...

Processing for Research Question 2
For research question 2, the following variables are considered relevant and extracted to a separate, reduced dataset dred2 (subsequently then loaded again if available as d2):

STATE
PROPDMG
PROPDMGEXP

# Initial load and dataset reduction.
if(!file.exists("./repdata_data_StormData.csv.bz2"))
{
    temp <- "./repdata_data_StormData.csv.bz2"
    download.file(
        "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
        temp, mode="wb"
    )
}
if(!file.exists("./dred2.rds"))
{
    d <- read.csv("repdata_data_StormData.csv.bz2", header=T)
    dred2 <- d[, c("STATE", "EVTYPE", "PROPDMG", "PROPDMGEXP")]
    saveRDS(dred2, file="./dred2.rds")
    rm(ls="d")
}
d2 <- readRDS("dred2.rds")

The inspection of the d2 dataset reveils that the PROPDMGEXP variable characterizes the magnitude of the PROPDMG variable, but in a non-tidy way.
Therefore the following rules are applied:

If PROPDMGEXP is '-/+/0/?' it is defined as 0
PROPDMGEXP 'h/H' are defined as 2, 'K' is defined as 3, 'm/M' are defined as 6, 'B' is defined as 9

# Normalize ambiguous factor levels.
# Normalization of factors appears easier when the factors are first converted
# to characters followed by converstion to a numerical type for subsequent
# use as exponents.
d2$PROPDMGEXP <- as.character(d2$PROPDMGEXP)
d2[which(d2$PROPDMGEXP==""), ]$PROPDMGEXP  <- "0"
d2[which(d2$PROPDMGEXP=="-"), ]$PROPDMGEXP <- "0"
d2[which(d2$PROPDMGEXP=="?"), ]$PROPDMGEXP <- "0"
d2[which(d2$PROPDMGEXP=="+"), ]$PROPDMGEXP <- "0"
d2[which(d2$PROPDMGEXP=="h"), ]$PROPDMGEXP <- "2"
d2[which(d2$PROPDMGEXP=="H"), ]$PROPDMGEXP <- "2"
d2[which(d2$PROPDMGEXP=="K"), ]$PROPDMGEXP <- "3"
d2[which(d2$PROPDMGEXP=="m"), ]$PROPDMGEXP <- "6"
d2[which(d2$PROPDMGEXP=="M"), ]$PROPDMGEXP <- "6"
d2[which(d2$PROPDMGEXP=="B"), ]$PROPDMGEXP <- "9"
d2$PROPDMGEXP <- as.numeric(d2$PROPDMGEXP)

Conversion to a data.table object allows to easily create a new variable representing the total damage:

library(data.table)
d2 <- as.data.table(d2)
d2[, dmg:=d2$PROPDMG*10^d2$PROPDMGEXP]

##         STATE     EVTYPE PROPDMG PROPDMGEXP   dmg
##      1:    AL    TORNADO    25.0          3 25000
##      2:    AL    TORNADO     2.5          3  2500
##      3:    AL    TORNADO    25.0          3 25000
##      4:    AL    TORNADO     2.5          3  2500
##      5:    AL    TORNADO     2.5          3  2500
##     ---                                          
## 902293:    WY  HIGH WIND     0.0          3     0
## 902294:    MT  HIGH WIND     0.0          3     0
## 902295:    AK  HIGH WIND     0.0          3     0
## 902296:    AK   BLIZZARD     0.0          3     0
## 902297:    AL HEAVY SNOW     0.0          3     0

This dataset is then used in the analysis (of question 2) in the next section.

Results

Question 1: What are the most harmful event types with respect to population health?
Population health indicators are considered to be

Fatalities: d1$FATALITIES
Injuries: d1$INJURIES

It appears as if temperature and tornados have highest fatality rates:

head(d1[order(d1$FAT, decreasing=T), ], 20)

##        STATE         EVTYPE FATALITIES INJURIES
## 198704    IL           HEAT        583        0
## 862634    MO        TORNADO        158     1150
## 68670     MI        TORNADO        116      785
## 148852    TX        TORNADO        114      597
## 355128    IL EXCESSIVE HEAT         99        0
## 67884     MA        TORNADO         90     1228
## 46309     KS        TORNADO         75      270
## 371112    PA EXCESSIVE HEAT         74      135
## 230927    PA EXCESSIVE HEAT         67        0
## 78567     MS        TORNADO         57      504
## 247938    WI   EXTREME HEAT         57        0
## 6370      AR        TORNADO         50      325
## 598500    TX EXCESSIVE HEAT         49        0
## 606363    CA EXCESSIVE HEAT         46       18
## 860386    AL        TORNADO         44      800
## 157885    TX        TORNADO         42     1700
## 362850    MO EXCESSIVE HEAT         42      397
## 629242    NY EXCESSIVE HEAT         42        0
## 78123     MS        TORNADO         38      270
## 83578     MO        TORNADO         37      176

In addition to temperature and tornados, it seems as if hurricanes, floods and rain are also responsible for a number of fatalities (these will be addressed below).
However, what if there are far more tornados than heat events so that the sum of tornado related fatalities would exceed heat related fatalities? This will be further inspected below.
In order to analyse this, all event types related to temperature and tornados are combined into a generalized HEAT/TORNADO category and then the number of related fatalities and are counted.
In order to normalize the categories, all levels relating either the expressions 'WARM', 'HEAT' or 'TORNADO' are collected:

# Unlike e.g. Bash, 'grep' in R returns indeces, not matched strings.
heat_related_levels <- grep("HEAT|WARM", x=levels(d1$EV))
heat_related_events <- d1$EV %in% levels(d1$EV)[heat_related_levels]
d1[heat_related_events, ]$EVTYPE <- factor("HEAT")

# 'TORNADO' related:
wind_related_levels <- grep("STORM|WIND", x=levels(d1$EV))
wind_related_events <- d1$EV %in% levels(d1$EV)[wind_related_levels]
d1[wind_related_events, ]$EVTYPE <- factor("WIND")

For simplicity, a new data frame is created and then analysed using a barplot:

# (Rotate x-axis labels and adjust margins.)
par(las=2); par(mar=c(8, 4, 2, 1))
k <- as.data.frame(tapply(d1$FATALITIES, INDEX=d1$EVTYPE, FUN=sum))
colnames(k) <- "FATALITIES"
barplot(k[order(k$FATALITIES, decreasing=T), ][1:10],
        main="10 Most Harmful Event Types [FATALITIES]", ylab="Fatalities")

plot of chunk unnamed-chunk-8

So indeed, as was guessed above, even if the most fatal event is a heat-type event, the total number of wind related fatalities is larger.

Injuries can be analysed analogously.

par(las=2); par(mar=c(8, 4, 2, 1))
k <- as.data.frame(tapply(d1$INJURIES, INDEX=d1$EVTYPE, FUN=sum))
colnames(k) <- "INJURIES"
barplot(k[order(k$INJURIES, decreasing=T), ][1:10],
        main="10 Most Harmful Event Types [INJURIES]", ylab="Injuries")

plot of chunk unnamed-chunk-9

To summarize, both with respect to fatalities and injuries, tornados are the most harmful events.

Question 2: Which events result in highest economic damage?

The events causing the largest economic damage are identified in the following figure.

par(las=2); par(mar=c(8, 6, 2, 1))
k <- as.data.frame(tapply(d2$dmg, INDEX=d1$EVTYPE, FUN=sum))
colnames(k) <- "DAMAGE"
barplot(k[order(k$DAMAGE, decreasing=T), ][1:10],
        main="Top 10 Most Damaging Event Types", ylab="Damage")

plot of chunk unnamed-chunk-10

The most damaging events are thus floods.

To do/Optional:

Handle .bz2 file extraction
Externalize normalization of event types to generic function
Combine events STORM/HURRICANE, FLOOD/WATER and possibly others