In this project, we explore NOAA’s Storm Event dataset in order to answer two main questions. Across the United States, which types of events (1) are most harmful with respect to population health? and (2) have the greatest economic consequences? Across the board, the data show tornados and flash floods to be the most harmful. While tornados, excessive heat, and lightning pose the most threat to humans, hail, floods, and flash floods tend to destroy the most property and crops.
The data we will be using come from NOAA’s storm event database. We will be using counts of fatalites and injuries to quantify harm to humans, or “population health”. We will be using property damage and crop damage (measured in dollars) to quantify “economic consequences”. Because missing data threatens our analysis the further back in time we go, we will be looking strictly at events from the year 2000 to present.
## download file
csv <- "repdata_data_StormData.csv"
csvbz2 <- paste(c(csv, ".bz2"), collapse="")
if (!file.exists(csvbz2)) {
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, csvbz2)
}
## load data into RAM (patience is a virtue)
bz2 <- bzfile(csvbz2)
data <- read.csv(bz2)
#close(bz2)
## pare down dataset, four metrics, > 2000, label ambiguity caused by abbreviation
data <- data[,c("EVTYPE","BGN_DATE","FATALITIES","INJURIES","PROPDMG","CROPDMG")]
data <- data[as.Date(data$BGN_DATE, "%m/%d/%Y") >= as.Date("01/01/2000", "%m/%d/%Y"),]
data[data$EVTYPE == "TSTM WIND",]$EVTYPE <- "THUNDERSTORM WIND"
Now we sum up the total harm done in each of our four metrics by event type. The resulting data represent four measurements per storm type: total harm done as a sum of each of the four metrics, from 2000 to present.
metrics <- c("FATALITIES","INJURIES","PROPDMG","CROPDMG")
mData <- melt(data, id=c("EVTYPE"), measure.vars=metrics)
dData <- dcast(mData, EVTYPE ~ variable, sum)
For each metric, rank the event type to see how it stacks in comparison to others. We will assume that within each category we are comparing apples to apples. For example, a fatality caused by a tornado or lightning strike is much more random and localized that one caused by, say, a drought or heat wave. Nonetheless, when expressed as a percentage, we can say that of all the fatalities represented in the dataset, event type A can claim a higher or lower percentage than event type B.
## calculate rankings
dData <- mutate(dData[order(dData$FATALITIES, decreasing = T),], frank = 1:nrow(dData))
dData <- mutate(dData[order(dData$INJURIES, decreasing = T),], irank = 1:nrow(dData))
dData <- mutate(dData[order(dData$PROPDMG, decreasing = T),], prank = 1:nrow(dData))
dData <- mutate(dData[order(dData$CROPDMG, decreasing = T),], crank = 1:nrow(dData))
## calculate percentages
dData <- mutate(dData, fpct = 100 * FATALITIES/sum(dData$FATALITIES))
dData <- mutate(dData, ipct = 100 * INJURIES/sum(dData$INJURIES))
dData <- mutate(dData, ppct = 100 * PROPDMG/sum(dData$PROPDMG))
dData <- mutate(dData, cpct = 100 * CROPDMG/sum(dData$CROPDMG))
## focus on any metric that makes the top 6 in any of the four categories
## (turns out there are nine of these)
top9 <- dData[dData$frank < 6 | dData$irank < 6 | dData$prank < 6 | dData$crank < 6,]$EVTYPE
## melt the data for easy plotting
mdData <- melt(dData[dData$EVTYPE %in% top9,], id = c("EVTYPE"), measure.vars = c("fpct", "ipct", "ppct", "cpct"))
## create plot
ggplot(mdData, aes(x = variable, y = value, col = EVTYPE)) +
ggtitle("Storm Damage by Type") +
xlab("Metric (fatalities, injuries, property, crops)") + ylab("% Share") +
geom_point(size = 5) +
scale_colour_brewer(palette="Set1")
So, from this graphic, we can do some subjective analysis.
- Tornados look to be the “most harmful” storm event type overall. They account for 20% of storm-related fatalities in the NOAA dataset from year 2000 to present, and a whopping 43% of injuries. However, they are eclipsed by other kinds of storms in terms of economic damage.
- The top two storm-related causes of death are tornados and excessive heat, the former beating the latter by only a few percentage points.
- Tornados cause more than five times as many injuries than their runner-up, excessive heat. One can infer then, that while tornados harm many more people than kill them, excessive heat is less merciful, killing a greater proportion of people affected.
- Thunderstorm winds account for nearly 30% of storm-related property damage, followed by flash floods and tornadoes.
- Hail accounts for 40% of storm-related crop damage, more than double either of its runner ups, flash floods and thunderstorm winds. However, it is interesting to note that injuries and fatalities caused by hail are negligible.