Tornados and Floods Rank Highest in Human and Economic Costs

Synopsis

I used a robust data set to determine what main weather events cause the most damage with respect to human and economic costs. Earlier years
in the dataset contained fewer records, so I chose to use the most recent years for this analysis (see fig. 1). For the human cost question, I looked at
both the total number of fatalities and the total number of injuries. The top 5 weather events for injuries (tornando, flood, excessive heat, lightning, and tstm wind)
accounted for almost 67% of the total (see fig. 2), while the top 5 weather events for fatalities (excessive heat, tornado, flash flood, heat, and lightning)
accounted for a little over 58% of the total. The top 5 weather events for economic damage (flood, hurricane/typhoon, storm surge, tornado,
and hail) accounted for almost 73% of the total (see fig. 3).

Data Processing

To extract the data from the bzip2 file, I used the bunzip2 function from the R.utils package. Once loaded into R, I selected the columns needed
to answer the two assigned questions (which type of events are most harmful with respect to population health, and which type have the greatest economic
consequences). Specifically, I selected: BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, and CROPDMGEXP.

library(R.utils)
bunzip2("./repdata_data_StormData.csv.bz2", "StormData.csv")

storm.data <- read.csv("./StormData.csv")
storm.data.sub <- storm.data[, c(2, 8, 23, 24, 25, 26, 27, 28)]

The earlier years in the data set had fewer records, and since I wanted to base my analysis on the most complete years, I graphed the total number of records for
each year to see if there was a natural break in the data. There was a large jump in total number of records starting in 1994 and increasing most years hence,
thus I chose to use the data from 1994-2011 for the analysis.

storm.data.sub$BGN_DATE <- strptime(storm.data.sub$BGN_DATE, "%m/%d/%Y %H:%M:%S")  # Convert to Date class
a <- 1950:2011
b <- vector("numeric", length = 62)
year.sum <- data.frame(Year = a, Count = b)

for (i in 1:62) {
    year.sum[i, 2] = sum(storm.data.sub$BGN_DATE$year == (i + 49))
}
plot(year.sum[, 1], year.sum[, 2], pch = 20, ylab = "Number of Data Records", 
    xlab = "Year", main = "Number of Data Records per Year")
abline(h = 20000, col = "red", lwd = 2)

plot of chunk unnamed-chunk-3

Fig. 1 Number of data records per year. Note the natural break at 1994.

# select records from years containing over 20,000 records
temp <- year.sum[year.sum$Count > 20000, ]
temp <- strptime(paste(as.character(temp[1, 1]), "-01-01", sep = ""), "%Y-%m-%d")
storm.data.sub <- storm.data.sub[storm.data.sub$BGN_DATE >= temp, ]

I split this subset of the data into two data frames: one to address the population health question (pop.harm), and another to address the economic consequences
question (econ.costs).

pop.harm <- storm.data.sub[, 2:4]
econ.costs <- storm.data.sub[, c(2, 5:8)]

The econ.costs data frame required additional preprocessing. First I pulled out all rows that had property damage, crop damage, or both greater than zero. The costs of
property damage and crop damage were each divided into two columns (four total). The first column in each set gave a cost number and the second column gave a
multiplier. The multipliers were letters refering to thousands (K), millions (M), or billions (B), so I converted them to numeric values. Finally,
I added up the total cost of property damage and crop damage and placed the results in a new column. The pop.harm data frame did not require this preprocessing step.

econ.costs <- econ.costs[!(econ.costs$PROPDMG == 0 & econ.costs$CROPDMG == 0), 
    ]  # Select only records with damage

##### Convert letters into numbers #####
econ.costs$PROPDMGEXP <- as.character(econ.costs$PROPDMGEXP)
econ.costs$CROPDMGEXP <- as.character(econ.costs$CROPDMGEXP)

econ.costs[, 3] <- ifelse((econ.costs$PROPDMGEXP != "K" & econ.costs$PROPDMGEXP != 
    "M" & econ.costs$PROPDMGEXP != "B"), "0", econ.costs$PROPDMGEXP)
econ.costs[, 3] <- ifelse(econ.costs$PROPDMGEXP == "K", "1000", econ.costs$PROPDMGEXP)
econ.costs[, 3] <- ifelse(econ.costs$PROPDMGEXP == "M", "1000000", econ.costs$PROPDMGEXP)
econ.costs[, 3] <- ifelse(econ.costs$PROPDMGEXP == "B", "1000000000", econ.costs$PROPDMGEXP)

econ.costs$PROPDMGEXP <- as.numeric(econ.costs$PROPDMGEXP)

econ.costs[, 5] <- ifelse(econ.costs$CROPDMGEXP != "K" & econ.costs$CROPDMGEXP != 
    "M" & econ.costs$CROPDMGEXP != "B", "0", econ.costs$CROPDMGEXP)
econ.costs[, 5] <- ifelse(econ.costs$CROPDMGEXP == "K", "1000", econ.costs$CROPDMGEXP)
econ.costs[, 5] <- ifelse(econ.costs$CROPDMGEXP == "M", "1000000", econ.costs$CROPDMGEXP)
econ.costs[, 5] <- ifelse(econ.costs$CROPDMGEXP == "B", "1000000000", econ.costs$CROPDMGEXP)

econ.costs$CROPDMGEXP <- as.numeric(econ.costs$CROPDMGEXP)

##### Create new column with total cost of damage #####
econ.costs$TOTCOST <- econ.costs$PROPDMG * econ.costs$PROPDMGEXP + econ.costs$CROPDMG * 
    econ.costs$CROPDMGEXP

Using the preprocessed data, I calculated sums for all event types and then ordered the results from highest to lowest.

library(plyr)
pop.sums <- ddply(pop.harm, "EVTYPE", function(df) sum(df[, c(2)]))
temp <- ddply(pop.harm, "EVTYPE", function(df) sum(df[, c(3)]))
pop.sums <- cbind(pop.sums, temp[, 2])
names(pop.sums) <- c("EVTYPE", "FATALITIES", "INJURIES")
pop.sums.FAT <- pop.sums[order(-pop.sums$FATALITIES), ]  #ordered by fatalities
pop.sums.INJ <- pop.sums[order(-pop.sums$INJURIES), ]  #ordered by inuries

econ.sums <- ddply(econ.costs, "EVTYPE", function(df) sum(df[, 6]))
econ.sums <- econ.sums[order(-econ.sums$V1), ]
names(econ.sums) <- c("EVTYPE", "COST")

Results

For the population question, the processed data yield slightly different answers depending on whether you look at fatalities or injuries. Below are the top 5
event types for fatalities and injuries, respectively.

##             EVTYPE FATALITIES INJURIES
## 122 EXCESSIVE HEAT       1903     6525
## 785        TORNADO       1593    22571
## 145    FLASH FLOOD        951     1754
## 260           HEAT        930     2095
## 437      LIGHTNING        794     5116

##             EVTYPE FATALITIES INJURIES
## 785        TORNADO       1593    22571
## 160          FLOOD        450     6778
## 122 EXCESSIVE HEAT       1903     6525
## 437      LIGHTNING        794     5116
## 806      TSTM WIND        241     3631

The percentage of total injuries of the top 5 events was higher than the percentage of total fatalities of the top 5 events, thus I chose to plot the number of
injuries caused by the top 5 events.

(percent.fatalities <- (sum(pop.sums.FAT$FATALITIES[1:5])/sum(pop.sums.FAT$FATALITIES)) * 
    100)

## [1] 58.4

(percent.injuries <- (sum(pop.sums.INJ$INJURIES[1:5])/sum(pop.sums.INJ$INJURIES)) * 
    100)

## [1] 66.98

library(ggplot2)
p <- head(pop.sums.INJ[, c(1, 3)])
ggplot(p, aes(x = factor(EVTYPE), y = INJURIES)) + geom_bar(stat = "identity") + 
    labs(title = "Top 5 Weather Events to Cause Injury") + ylab("Number of Injuries") + 
    xlab("Weather Event")

plot of chunk unnamed-chunk-10

Fig. 2 Top 5 weather events to cause injuries.

For economic damage, the top 5 weather events caused almost 73% of the total economic damage.

##                EVTYPE      COST
## 67              FLOOD 1.497e+11
## 180 HURRICANE/TYPHOON 7.191e+10
## 268       STORM SURGE 4.319e+10
## 319           TORNADO 2.597e+10
## 108              HAIL 1.831e+10

(percent.cost <- (sum(econ.sums$COST[1:5])/sum(econ.sums$COST)) * 100)

## [1] 72.93

p <- head(econ.sums)
ggplot(p, aes(x = factor(EVTYPE), y = COST)) + geom_bar(stat = "identity") + 
    labs(title = "Top 5 Weather Events to Cause Economic Damage") + ylab("Cost in Dollars") + 
    xlab("Weather Event") + theme(axis.text.x = element_text(angle = 45, vjust = 0.5, 
    hjust = 0.5))

plot of chunk unnamed-chunk-12

Fig. 3 Top 5 weather events to cause economic damage.