This investigation uses the National Oceanic and Atmospheric Administration’s storm event database covering US major storm and weather events during the period 1950 to 2011. The analysis looks at two categories of consequences (health and economic) associated with specific event types, identifying the event types with the most impact for each category. In the case of the health consequences, the available data offered fatality and injury totals by event, events that are qualitatively different and inappropriate to combine. We examined by the top two health impacts for each fatalities and injuries, and combining the two sets. We conclude that tordanos are by far the largest source of both fatalities and injuries. Excessive heat and thunderstorm wind are both an order of magnitude lower in injuries, and significantly lower in fatalies as well. In the case of the economic consequences, the variables available were crop damage in dollars and property damage in dollars, which we summarized by event type and identified the event types with the highest economic impact. Flood is by far the most economically damaging weather event. Hurricane, tornado and storm surge are grouped lower, followed a significant drop to the next level.
First read in the data from the csb.bz2 file.
storm <- read.csv(file = "repdata_data_StormData.csv.bz2", header = TRUE)
First look at the key variable EVTYPE shows 985 levels. Looking through them I see a number of leading blanks, so I will strip those (leaving 977).
Next the coding of the damage amounts needs to be translated into a meaning format. These appear to be divided into an amount column for crop and property damange (CROPDMG and PROPDMG) and associated exponent columns (CROPDMGEXP and PROPDMGEXP). The unique values seen for the exponents are:
B 1,000,000,000
M or m 1,000,000
K or k 1,000
H or h 100
+ or 1 1
0, 2-9, ?, - 1 (likely an incorrect value, just scale at 1)
require(stringr)
## Loading required package: stringr
storm$EVTYPE <- str_trim(storm$EVTYPE)
expcon <- c(B=1000000000,M=1000000,K=1000,H=100)
exptxt <- names(expcon)
# write one exponent cleanup function that will work on both
expclean <- function(x) {
x <- toupper(x)
ifelse(x %in% exptxt, expcon[x], 1)
}
storm$CropDmg <- expclean(storm$CROPDMGEXP) * storm$CROPDMG
storm$PropDmg <- expclean(storm$PROPDMGEXP) * storm$PROPDMG
Finally, we will reduce this to a tidy data set that only contains the health and economic variables that we need, at the summary level by Event Type.
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
tstorm <- storm %>%
select (EVTYPE,FATALITIES,INJURIES,CropDmg,PropDmg) %>%
group_by(EVTYPE) %>%
summarize(sum(FATALITIES),sum(INJURIES),sum(CropDmg),sum(PropDmg))
names(tstorm) <- c("eventtype","fatalities","injuries","cropdmg","propdmg")
Given that there are almost 1,000 event types, we need to set a threshold. My strategy is to take the top few FATALITIES events and the top ten INJURIES events, recombine them and plot on a scatterplot.
topnum <- 2
topfatal <- tstorm %>% top_n(topnum, fatalities)
topinjur <- tstorm %>% top_n(topnum, injuries)
tophealth <- union(topfatal,topinjur)
tophealth <- tophealth %>% arrange(desc(fatalities))
Check to see if these are representing enough of the health effects
percentfat <- sum(tophealth$fatalities) / sum(tstorm$fatalities)
percentinj <- sum(tophealth$injuries) / sum(tstorm$injuries)
print("Percentage of fatalities in top points")
## [1] "Percentage of fatalities in top points"
print(percentfat)
## [1] 0.5308683
print("Percentage of injuries in top points")
## [1] "Percentage of injuries in top points"
print(percentinj)
## [1] 0.7459581
Now set up for the graph. I looked at descending bar charts for both fatalities and injuries, however, this was somewhat visually confusing, since the the two sets of event types on the two different charts were not the same. Fatalaities cannot be mixed with injuries, there is no conversion that can equate the two. So a scatterplot might give a way of combing both into one graph if we keep the “top” data points limited in quantity. After exploring this, it looked like restricting this to three data points would show the 3 points representing the event types with the 2 highest number of fatalities and the 2 highest number of injuries. They overlap in one case, resulting in 3 data points. This were far enough apart that they could be clearly labelled. These 3 points represent fully 75% of the injuries from all causes and 53% of all fatalities.
library(ggplot2)
g <- qplot(tophealth$fatalities,tophealth$injuries)
g <- ggplot(tophealth,aes(fatalities,injuries)) + expand_limits(x=0,y=0)
g <- g + xlab("Fatalities by Event Type") + ylab("Injuries by Event Type")
g <- g + labs(title="US Storm Injuries vs. Fatalities by Event Type - 1950-2011")
g <- g + geom_point(aes(color=eventtype),size=4)
g <- g + labs(color="Event Type")
print(g)
We see that tordanos are by far the largest source of both fatalities and injuries. Excessive heat and thunderstorm wind are both an order of magnitude lower in injuries, and significantly lower in fatalies as well.
Unlike fatalities and injuries, which are incomparable because they are qualitatively different, crop damage and property damage have been reduced to a common index for us - they are both in dollars.
topnum <- 10
topecon <- tstorm %>% top_n(topnum, cropdmg+propdmg)
topecon$totdmg <- (topecon$cropdmg + topecon$propdmg)/1000000000
topecon <- topecon %>% select(eventtype,totdmg) %>% arrange(desc(totdmg))
par(mgp=c(3,0,1),mar=c(6.1,4.1,4.1,2.1))
barplot(topecon$totdmg,
main="Most economically harmful weather events 1950-2011",
col="red",
ylab= "Total Property and Crop Damage ($B)",
names.arg = topecon$eventtype,
cex.axis=0.6, cex.names=0.6, las=2)
Flood is by far the most economically damaging weather event. Hurricane, tornado and storm surge are grouped lower, followed a significant drop to the next level.