We wish to determine which storm event type is a) the most harmful to population health, measured here in terms of number of injuries and number of fatalities, and b) have the greatest economic consequences, measured here in the estimated USD
value of property and crop damage. This is done by examining the NOASS Storm Database, a data set of storm events that have occurred across the United States of America, with data collected between 1950 through to 2011,
Tornadoes are the event which cause the greatest number of fatalities and injuries (96,979) as well as costing the most in combined estimated property and crop damage (USD 3,312,277).
We have used the NOAA Storm database, which can be downloaded as a compressed CSV file from here. The data is described in this document, and this FAQ.
Our first step is to make sure that we’ve downloaded the data and then read it in.
if (!file.exists('StormData.csv.bz2')) {
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
destfile = "StormData.csv.bz2",
method="curl")
}
data <- read.csv('StormData.csv.bz2')
Our analyses is interested in the following four five variables:
The EVTYPE field is very dirty. It consists of a collection of strings that differ from each other sometimes merely by case. We will only clean up the data by ensuring that we use a consistent case, and replacing the one unknown value (represented with ‘?’) with NA.
library(dplyr)
data <- data %>%
mutate(EVTYPE = factor(toupper(EVTYPE)))
data$EVTYPE[which(data$EVTYPE == '?')] <- NA
We have FATALITIES and INJURIES that interest us. First, we want to calculate the sum and mean of these values per event type. We also introduce a new variable total.health, that is the sum of FATALITIES and INJURIES.
data <- data %>%
mutate(total.health = INJURIES + FATALITIES,
damage = PROPDMG + CROPDMG)
health.summary <- data %>%
group_by(EVTYPE) %>%
summarise(fatality.sum = sum(FATALITIES), fatality.mean = mean(FATALITIES),
injury.sum = sum(INJURIES), injury.mean = mean(INJURIES),
health.sum = sum(total.health), health.mean = mean(total.health)) %>%
arrange(desc(health.sum), desc(health.mean))
For plotting purposes, we also want to produce a cleaner version of this data, where the type of aggregate value (mean and sum) are differentiated by a new variable:
library(tidyr)
health.clean <- health.summary %>%
# We only want to look at those events with 500 or more injuries and fatalities
filter(health.sum > 500) %>%
# We want to shift some variables in to columns of their own
gather(variable, value, c(injury.sum, fatality.sum, injury.mean, fatality.mean)) %>%
# And then separate out the type of health related event in to its own column
separate(variable, into=c("health.type", "aggregate"), sep="\\.") %>%
# and then add sum / mean back as columns of their own
spread(aggregate, value)
And this is what the clean health data looks like:
knitr::kable(head(health.clean %>% arrange(desc(health.sum), EVTYPE), n = 10),
digits = 2)
| EVTYPE | health.sum | health.mean | health.type | mean | sum |
|---|---|---|---|---|---|
| TORNADO | 96979 | 1.60 | fatality | 0.09 | 5633 |
| TORNADO | 96979 | 1.60 | injury | 1.51 | 91346 |
| EXCESSIVE HEAT | 8428 | 5.02 | fatality | 1.13 | 1903 |
| EXCESSIVE HEAT | 8428 | 5.02 | injury | 3.89 | 6525 |
| TSTM WIND | 7461 | 0.03 | fatality | 0.00 | 504 |
| TSTM WIND | 7461 | 0.03 | injury | 0.03 | 6957 |
| FLOOD | 7259 | 0.29 | fatality | 0.02 | 470 |
| FLOOD | 7259 | 0.29 | injury | 0.27 | 6789 |
| LIGHTNING | 6046 | 0.38 | fatality | 0.05 | 816 |
| LIGHTNING | 6046 | 0.38 | injury | 0.33 | 5230 |
On the economic side, we’re mostly interested in the total economic consequence, which we calculate as the sum of the property damage and crop damage. Below, we calculate this value for each event type.
economy.summary <- data %>%
group_by(EVTYPE) %>%
# damage is the sum of property damage and crop damage
summarise(total.damage = sum(damage), mean = mean(damage)) %>%
arrange(desc(total.damage), desc(mean))
Below we show show total number of health consequences for the storm events which have managed to reach a total of 500 or more such consequences. We have differentiated between injuries and fatalities by using a stacked bar graph.
library(ggplot2)
library(scales)
ggplot(health.clean, aes(x = EVTYPE, y = sum, fill = health.type)) +
geom_bar(stat = "identity") +
theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_y_continuous(labels = comma) +
labs(title = "Total number of health consequences for storm events",
x = "Storm event",
y = "Total number of health consequences (injuries + fatalities)")
Notice that the TORNADO event has by far the greatest number of health consequences in absolute terms. There have been 96,979 injuries and related fatalities from tornadoes, while the storm event type with the second largest number of health consequences is EXCESSIVE HEAT, which has an order of magnitude fewer injuries and fatalities: 8,428.
Here are the ten event with the most health consequences:
knitr::kable(head(health.summary %>% select(EVTYPE, health.sum), n = 10))
| EVTYPE | health.sum |
|---|---|
| TORNADO | 96979 |
| EXCESSIVE HEAT | 8428 |
| TSTM WIND | 7461 |
| FLOOD | 7259 |
| LIGHTNING | 6046 |
| HEAT | 3037 |
| FLASH FLOOD | 2755 |
| ICE STORM | 2064 |
| THUNDERSTORM WIND | 1621 |
| WINTER STORM | 1527 |
In contrast, tornadoes on average do not have a high number of injuries and fatalities. That distinction goes to hurricanes and typhoons, as can be seen in the chart below, which shows average number of health consequences across the event types which have had more than 500 health consequences.
ggplot(health.clean, aes(x = EVTYPE, y = mean, fill = health.type)) +
geom_bar(stat = "identity") +
theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Average number of health consequences for storm events",
x = "Storm event",
y = "Average number of health consequences (injuries + fatalities)")
Here are the event types with the highest means:
knitr::kable(head(health.summary %>%
filter(health.sum > 500) %>%
arrange(desc(health.mean)), n = 10),
digits = 2)
| EVTYPE | fatality.sum | fatality.mean | injury.sum | injury.mean | health.sum | health.mean |
|---|---|---|---|---|---|---|
| HURRICANE/TYPHOON | 64 | 0.73 | 1275 | 14.49 | 1339 | 15.22 |
| HEAT WAVE | 172 | 2.29 | 379 | 5.05 | 551 | 7.35 |
| EXCESSIVE HEAT | 1903 | 1.13 | 6525 | 3.89 | 8428 | 5.02 |
| HEAT | 937 | 1.22 | 2100 | 2.74 | 3037 | 3.96 |
| RIP CURRENTS | 204 | 0.67 | 297 | 0.98 | 501 | 1.65 |
| TORNADO | 5633 | 0.09 | 91346 | 1.51 | 96979 | 1.60 |
| FOG | 62 | 0.12 | 734 | 1.36 | 796 | 1.48 |
| RIP CURRENT | 368 | 0.78 | 232 | 0.49 | 600 | 1.28 |
| ICE STORM | 89 | 0.04 | 1975 | 0.98 | 2064 | 1.03 |
| LIGHTNING | 816 | 0.05 | 5230 | 0.33 | 6046 | 0.38 |
Below we plot the number storm event types which have exceeded $20 000 in estimated damage to property and crops combined.
ggplot(filter(economy.summary, total.damage > 20000),
aes(x = EVTYPE, y = total.damage)) +
geom_bar(stat = "identity", fill="grey") +
theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_y_continuous(labels = comma) +
labs(title = "Total economic damage in US dollars",
x = "Storm event",
y = "Damage in US dollars (property + crop damage)")
Notice that tornadoes have the greatest economic consequences, causing an estimated total of $3,312,277 of damage. The top ten storm events by economic consequence are:
knitr::kable(head(economy.summary %>%
mutate(total.damage = paste0('$', comma(total.damage))),
n = 10),
digits = 2)
| EVTYPE | total.damage | mean |
|---|---|---|
| TORNADO | $3,312,276.68 | 54.61 |
| FLASH FLOOD | $1,599,325.05 | 29.47 |
| TSTM WIND | $1,445,198.21 | 6.57 |
| HAIL | $1,268,289.66 | 4.39 |
| FLOOD | $1,067,976.36 | 42.17 |
| THUNDERSTORM WIND | $943,635.62 | 11.43 |
| LIGHTNING | $606,932.39 | 38.53 |
| THUNDERSTORM WINDS | $464,978.11 | 22.31 |
| HIGH WIND | $342,014.77 | 16.92 |
| WINTER STORM | $134,699.58 | 11.78 |