Reproducible Research: Peer Assessment 2

Synopsis

In this paper, we will investigate a storm database from the U.S. National Oceanic and Atmospheric Administration’s (NOAA). This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. We will attempt to give some explanation on the variables and data, but for more information, please refer to the Storm Data Documentation.

We will load the raw data, processes it, and attempt to answer the following questions:

Across the United States, which types of events (as indicated in the ‘EVTYPE’ variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

Data Processing

Most Harmful Events to Population Health

A compressed file containing the data can be found at this link. The code to download and unzip is in the raw .Rmd file. Here, we will assume the file is in the working directory and load it.

raw <- read.csv("repdata-data-StormData.csv")

To answer the first question regarding “most harmful to health” events, we’ll want to process the data a bit to look at the resulting injuries and deaths from these disasters. We want to clean this data up a bit by clustering similar events in the ENVTYPE category, but this presents issues. We could cluster by keywords like wind or rain, but due to how the events are titled, we will end up counting certain observations multiple times. For instance, the event entitled RAIN/WIND would contribute fatalities and injuries to both categories. While this may be useful in some settings, here we will simply look at data for the original data’s event names.

It is beyond the scope of this paper to determine the appropriate weighted importance of deaths and injuries, so let us measure harm as the sum of injuries and fatalities. We will create a new dataframe health, which contains the variables ENVTYPE and harm which is the sum of FATALITIES,INJURIES.

health <- raw[,c(8,23,24)]
health$harm <- health$FATALITIES + health$INJURIES

Let us aggregate the data for total sums and mean averages.

sums <- aggregate(harm~EVTYPE, data = health, sum)
avgs <- aggregate(harm~EVTYPE, data = health, mean)

These new dataframes, sums and avgs, will aid in determining how harmful these disasters have proven to be. We are fortunate that all observations are complete observations and we do not need to impute missing data. Now, let’s order by our harm variable in descending order and trim to find the top 10 offenders for both the total harm and average harm.

top_sums<- sums[with(sums,order(-harm)),][1:10,]
top_avgs<- avgs[with(avgs,order(-harm)),][1:10,]

We will plot top_sums and top_avgs in the results section.

library(ggplot2)
h_sum <- ggplot(data=top_sums, aes(x=EVTYPE, y=harm,fill=EVTYPE)) +
    geom_bar(stat="identity") + 
    theme(axis.text.x = element_blank()) +
    ggtitle("Top 10 most harmful events overall")+ scale_fill_hue(l=40)
h_avg <- ggplot(data=top_avgs, aes(x=EVTYPE, y=harm,fill=EVTYPE)) +
    geom_bar(stat="identity") + 
    theme(axis.text.x = element_blank()) +
    ggtitle("Top 10 most harmful events on average")

Events with Greatest Economic Consequences

Calculating economic consequences will be much more cumbersome. This is because the dataset contains property and crop damage across two variables. For instance, if we wanted to calculate damage to crops for a given event, we would need to multiple the value under CROPDMG by the corresponding multiple associated with the value in CROPDMGEXP. This gets even trickier since CROPDMGEXP should only contain K, M, and B for thousands, millions, and billions respectively, though some entries have other unidentified values such as a “0” or “?”. Thus, we will consider use 0 as a multiplier for all cases where we do not have a K, M, or B.

econ <- raw[,c(8,25:28)]
econ$PROPDMGEXP <- tolower(econ$PROPDMGEXP)
econ$CROPDMGEXP <- tolower(econ$CROPDMGEXP)

m <- c("k","m","b")
n <- c(1000,1000000,1000000000)

index_p<- econ$PROPDMGEXP != m[1] & econ$PROPDMGEXP != m[2] & econ$PROPDMGEXP != m[3]
econ$PROPDMGEXP[index_p] <- 0

index_c<- econ$CROPDMGEXP != m[1] & econ$CROPDMGEXP != m[2] & econ$CROPDMGEXP != m[3]
econ$CROPDMGEXP[index_c] <- 0

for (i in 1:length(m)){
    index_p <- econ$PROPDMGEXP == m[i]
    index_c <- econ$CROPDMGEXP == m[i]
    econ$PROPDMGEXP[index_p] <- n[i]
    econ$CROPDMGEXP[index_c] <- n[i]
}

econ$PROPDMGEXP <- as.numeric(econ$PROPDMGEXP)
econ$CROPDMGEXP <- as.numeric(econ$CROPDMGEXP)

econ$total_damage <- with(econ,PROPDMG*PROPDMGEXP + CROPDMG*CROPDMGEXP)

Now let us aggregate and order the resumes by overall damage and average damage.

damage_sums <- aggregate(total_damage~EVTYPE, data = econ, sum)
damage_avgs <- aggregate(total_damage~EVTYPE, data = econ, mean)
top_damage_sums<- damage_sums[with(damage_sums,order(-total_damage)),][1:10,]
top_damage_avgs<- damage_avgs[with(damage_avgs,order(-total_damage)),][1:10,]

We’ll store some plots now to evaluate in the results section.

dmg_sum <- ggplot(data=top_damage_sums, aes(x=EVTYPE, y=total_damage,fill=EVTYPE)) +
    geom_bar(stat="identity") + 
    theme(axis.text.x = element_blank()) +
    ggtitle("Top 10 most costly events overall") + scale_fill_hue(l=40)

dmg_avg <- ggplot(data=top_damage_avgs, aes(x=EVTYPE, y=total_damage,fill=EVTYPE)) +
    geom_bar(stat="identity") + 
    theme(axis.text.x = element_blank()) +
    ggtitle("Top 10 most costly events on average")

Results

Now, let us visualize the processed data.

Most Harmful Events to Population Health

head(top_sums)

##             EVTYPE  harm
## 834        TORNADO 96979
## 130 EXCESSIVE HEAT  8428
## 856      TSTM WIND  7461
## 170          FLOOD  7259
## 464      LIGHTNING  6046
## 275           HEAT  3037

From the data, we see that overall Tornadoes have caused far more fatalities and injuries than any other events.

h_avg

However, when we look at the average harm per event, they seem to be a bit more tame. This is because they occur much more frequently than events such as a heat wave. Though, when heat waves do occur, they are more harmful on average.

Events with Greatest Economic Consequences

Now, let us take a look at the economic consequences of these events.

dmg_sum

Similar to harmfulness, we have a top offender, though we do see this is worth plotting. We see our decision to not categorize events is having an impact because we see two hurricane related events. Floods cause the most damage with hurricanes, tornadoes, and storm surges not far behind. Now lets visualize the average economic consequences.

dmg_avg

Again, at bit more competition in the averages. Some rare events such as wild fires are in our top 10, but these events are far rarer than say, a tornado.

Overall recommendation

Without a clean way to bundle the data into fewer categories, we have a make some inferences based on these original event types. We want to spend our time and resources on the most frequently occurring, most harmful and most costly events. This definitely appears to be Tornadoes. We should also have a plan in place for less frequent, but more severe events such as hurricanes, storm surges, and heat waves.