In this article we’ll use the NOAA storm database to determine which types of storm are the most harmful to population health, and which have the greatest economic consequences. The storm database contains measurements of human fatalities and injuries, as well damage to property and crops categorized by storm type. We will focus on identifying the storm types that have the greatest relative impact and their frequency, so that resources can be allocated in an efficient manner.
We’ll be using data.table instead of data.frame, reshape2 to help with aggregation, ggplot2 for graphics, knitr for processing R Markdown to HTML, xtable for HTML table formatting and the R.utils package to help with file decompression. Here we set global knitr options and allow printing of larger numerics in fixed format and set the working directory.
suppressMessages({
library(data.table)
library(reshape2)
library(ggplot2)
library(knitr)
library(xtable)
library(R.utils)
})
opts_knit$set(progress=FALSE)
opts_chunk$set(echo=TRUE, message=FALSE, tidy=TRUE, comment=NA,
fig.path="figure/", fig.keep="high", fig.width=10, fig.height=6,
fig.align="center")
options(scipen=10, stringsAsFactors=FALSE)
setwd('~/R_projects/jhuds/c5p2')
The data used in this analysis was originally sourced from the U.S. Dept. of Commerce, National Oceanic & Atmospheric Administration (NOAA), National Weather Service. The specific version of the storm database we’ll be using is:
Department of Commerce / National Oceanic & Atmospheric Administration / National Weather Service NATIONAL WEATHER SERVICE INSTRUCTION 10-1605 AUGUST 17, 2007 Operations and Services Performance, NWSPD 10-16 STORM DATA PREPARATION
This document is availble available at http://www.nws.noaa.gov/directives/
If necessary, download and unzip the storm database. Read decompressed file into a data.table.
csvFile <- file.path(getwd(), "StormData.csv")
bz2File <- paste0(csvFile, ".bz2")
if (!file.exists(csvFile)) {
if (!file.exists(bz2File)) {
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
bz2File)
}
bunzip2(bz2File)
}
data <- as.data.table(read.csv(csvFile))
From the storm database documentation the following items have been determined to be relevant to this analysis:
Create a subset of these columns for both the health & ecomonic analysis. Since there is mixed case in the character fields, convert them to uppercase to help with aggregation.
ads <- data[, list(EVTYPE = toupper(EVTYPE), FATALITIES, INJURIES, PROPDMG,
PROPDMGEXP = toupper(PROPDMGEXP), CROPDMG, CROPDMGEXP = toupper(CROPDMGEXP))]
There are 0 NA values in this data set.
Find the event types that have recorded adverse health impacts, then subset the data to just those event types.
health_impact_events <- unique(ads[FATALITIES > 0 | INJURIES > 0, EVTYPE])
health_impact_data <- ads[EVTYPE %in% health_impact_events, list(EVTYPE = as.factor(EVTYPE),
FATALITIES, INJURIES)]
Find the event types that have recorded adverse economic impacts, then subset the data to just those event types.
econ_impact_events <- unique(ads[PROPDMG > 0 | CROPDMG > 0, EVTYPE])
econ_impact_data <- ads[EVTYPE %in% econ_impact_events, list(EVTYPE = as.factor(EVTYPE),
PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)]
Only two significant digits for the crop and property damage measurements are stored, so we’ll need to convert these to a common scale using the corresponding exponent. Display a frequency table of all exponents used. Set non-numeric exponents to their numeric equivalents, and inspect converted values.
sort(table(c(econ_impact_data$PROPDMGEXP, econ_impact_data$CROPDMGEXP)), decreasing = TRUE)
K M 0 B 5 1 ? 2
1080977 706403 13332 235 49 27 25 15 14
H + 7 3 4 6 - 8
7 5 5 4 4 4 1 1
econ_impact_data <- econ_impact_data[PROPDMGEXP %in% c("", "?", "+", "-"), `:=`(PROPDMGEXP,
"0")]
econ_impact_data <- econ_impact_data[CROPDMGEXP %in% c("", "?", "+", "-"), `:=`(CROPDMGEXP,
"0")]
econ_impact_data <- econ_impact_data[PROPDMGEXP == "H", `:=`(PROPDMGEXP, "2")]
econ_impact_data <- econ_impact_data[CROPDMGEXP == "H", `:=`(CROPDMGEXP, "2")]
econ_impact_data <- econ_impact_data[PROPDMGEXP == "K", `:=`(PROPDMGEXP, "3")]
econ_impact_data <- econ_impact_data[CROPDMGEXP == "K", `:=`(CROPDMGEXP, "3")]
econ_impact_data <- econ_impact_data[PROPDMGEXP == "M", `:=`(PROPDMGEXP, "6")]
econ_impact_data <- econ_impact_data[CROPDMGEXP == "M", `:=`(CROPDMGEXP, "6")]
econ_impact_data <- econ_impact_data <- econ_impact_data[PROPDMGEXP == "B",
`:=`(PROPDMGEXP, "9")]
econ_impact_data <- econ_impact_data[CROPDMGEXP == "B", `:=`(CROPDMGEXP, "9")]
sort(table(c(econ_impact_data$PROPDMGEXP, econ_impact_data$CROPDMGEXP)), decreasing = TRUE)
0 3 6 9 5 1 2 7 4
1081233 706407 13336 49 27 25 21 5 4
8
1
Convert the exponet columns to numeric and then use the exponent to convert the measurements to a common scale
econ_impact_data <- econ_impact_data[, `:=`(PROPDMGEXP, as.numeric(PROPDMGEXP))]
econ_impact_data <- econ_impact_data[, `:=`(CROPDMGEXP, as.numeric(CROPDMGEXP))]
econ_impact_data <- econ_impact_data[, `:=`(PROPDMGSCALED, (PROPDMG * 10^PROPDMGEXP)/10^6)]
econ_impact_data <- econ_impact_data[, `:=`(CROPDMGSCALED, (CROPDMG * 10^CROPDMGEXP)/10^6)]
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health? Compute total and average fatalities and injuries by event type. Plot the top 10 storm event types having biggest impact on health in terms of average number of fatalities or injuries.
agg_health <- health_impact_data[, list(n = as.double(.N), tot_fatalities = as.double(sum(FATALITIES)),
tot_injuries = as.double(sum(INJURIES)), avg_fatalities = sum(FATALITIES)/.N,
avg_injuries = sum(INJURIES)/.N), by = EVTYPE]
top <- 10
agg_health_top <- agg_health[order(avg_fatalities, decreasing = TRUE), ][1:top]
agg_health_top <- rbind(agg_health_top, agg_health[order(avg_injuries, decreasing = TRUE),
][1:top])
agg_health_melt <- melt(agg_health_top, id = "EVTYPE", measure.vars = c("avg_fatalities",
"avg_injuries"))
gh <- ggplot(data = agg_health_melt, aes(x = reorder(EVTYPE, value), y = value))
gh <- gh + geom_bar(stat = "identity") + coord_flip()
gh <- gh + facet_grid(~variable)
gh <- gh + ggtitle("Top Storm Events with respect to population health")
gh <- gh + xlab("Storm Event Type")
gh + ylab("Average impact (number of health casualties)")
On average, we can see that tornadoes, thunderstorm wind, hail, and tropical storms are the most deadly storm events. Tropical storms, high wind and seas, and wild fires cause the most injuries. All of these have wind as a component or are exacerbated by wind.
Across the United States, which types of events have the greatest economic consequences? Compute total and average property & crop damage by event type. Plot the top 10 storm event types having biggest economic impact in terms of the average cost of property and crop damage.
agg_econ <- econ_impact_data[, list(n = as.double(.N), tot_prop_damage = as.double(sum(PROPDMGSCALED)),
tot_crop_damage = as.double(sum(CROPDMGSCALED)), avg_prop_damage = sum(PROPDMGSCALED)/.N,
avg_crop_damage = sum(CROPDMGSCALED)/.N), by = EVTYPE]
top <- 10
agg_econ_top <- agg_econ[order(avg_prop_damage, decreasing = TRUE), ][1:top]
agg_econ_top <- rbind(agg_econ_top, agg_econ[order(avg_crop_damage, decreasing = TRUE),
][1:top])
agg_econ_melt <- melt(agg_econ_top, id = "EVTYPE", measure.vars = c("avg_prop_damage",
"avg_crop_damage"))
ge <- ggplot(data = agg_econ_melt, aes(x = reorder(EVTYPE, value), y = value))
ge <- ge + geom_bar(stat = "identity") + coord_flip()
ge <- ge + facet_grid(~variable)
ge <- ge + ggtitle("Top Storm Events with respect to Economic Health Impact")
ge <- ge + xlab("Storm Event Type")
ge + ylab("Average impact (milions $ US)")
On average, we can see that tornadoes, heavy rain / severe weather, and hurricanes / typhoons cause the most property damage. The cost of crop damage from these events is much less than property damage. Excessive wetness and hurricanes / typhoons cause the most crop damage. All of these have wind as a component or are exacerbated by wind.
Finally, we’ll take a look a overall storm type frequency, by combining the frequency figures from the health and economic impact sumamries.
event_freq <- rbind(agg_health[, list(EVTYPE, n)], agg_econ[, list(EVTYPE, n)])
agg_event <- event_freq[, list(n = sum(n)), by = EVTYPE]
agg_event_top <- agg_event[order(n, decreasing = TRUE), ][1:20]
gf <- ggplot(data = agg_event_top, aes(x = reorder(EVTYPE, n), y = n))
gf <- gf + geom_bar(stat = "identity") + coord_flip()
gf <- gf + ggtitle("Top 20 Storm Event Types")
gf <- gf + xlab("Storm Event Type")
gf + ylab("Number of Storms")
Here we can see that hail & thunderstom wind occur far more frequently than other types of storms, followed by tornadoes and flash floods. Of these high-frequency storm types, we notice that the top three also had very high average impact in terms population health and damage costs.