In this report, we analyze the health and economic consequences of severe weather events, using the NOAA Storm Database. We find that tornadoes, thunderstorm winds, and hail are associated with both the highest fatality rate across all event types in the database, and the highest average economic damage across all event types in the database.
First we read the data in.
data <- read_csv("repdata_data_StormData.csv.bz2")
## Parsed with column specification:
## cols(
## .default = col_character(),
## STATE__ = col_double(),
## COUNTY = col_double(),
## BGN_RANGE = col_double(),
## COUNTY_END = col_double(),
## END_RANGE = col_double(),
## LENGTH = col_double(),
## WIDTH = col_double(),
## F = col_integer(),
## MAG = col_double(),
## FATALITIES = col_double(),
## INJURIES = col_double(),
## PROPDMG = col_double(),
## CROPDMG = col_double(),
## LATITUDE = col_double(),
## LONGITUDE = col_double(),
## LATITUDE_E = col_double(),
## LONGITUDE_ = col_double(),
## REFNUM = col_double()
## )
## See spec(...) for full column specifications.
One thing to note is that the data set is a total mess. Ideally, we would fix all the EVTYPE (event type) labels so that we could really see which types of severe weather events have large impacts. It would be nice if there were some easy way to do that. But based on my findings, I really think fixing the data would require manual classification of all the hundreds of event types, as well as additional research into what some labels even mean. For instance, I could try to replace all the labels that have “tstm” or “thunder” in them with a blanket “thunderstorm” label, but some labels appear to be aggregates, e.g. “TORNADOES, TSTM WIND, HAIL”, and it seems like I would easily count things twice. There’s also a whole set of labels for month summaries. Not to mention there are many misspellings that are not easy to predict, and could probably only fully be handled by doing this cleaning process by hand. Instead of taking a week to fix this, or making a half-hearted attempt to clean all this up, I will leave the labels as given, and state that the results of the following analyses come with no warranty.
A preprocessing step we can do, however, is to create a new variable that combines the information from PROPDMG and PROPDMGEXP into a single number. We will do the same for CROPDMG and CROPDMGEXP. Then we can sum the two types of damge into the total damage. Finally, we can group the data by event type, and summarize by the mean of the damage measures.
convert <- function(s) {
n <- length(s)
x <- numeric(n)
for (i in 1:n) {
if (s[i] == "h") x[i] = 100
if (s[i] == "k") x[i] = 1000
if (s[i] == "m") x[i] = 1000000
if (s[i] == "b") x[i] = 1000000000
}
return(x)
}
econ_data <- data %>%
mutate(PROPDMGEXP = str_to_lower(PROPDMGEXP), CROPDMGEXP = str_to_lower(CROPDMGEXP)) %>%
filter(PROPDMGEXP %in% c("h", "k", "m", "b"), CROPDMGEXP %in% c("h", "k", "m", "b")) %>%
mutate(prop_dmg = PROPDMG * convert(PROPDMGEXP), crop_dmg = CROPDMG * convert(CROPDMGEXP)) %>%
mutate(total_dmg = prop_dmg + crop_dmg) %>%
group_by(EVTYPE) %>%
summarize(total_dmg = mean(total_dmg), prop_dmg = mean(prop_dmg), crop_dmg = mean(crop_dmg)) %>%
arrange(desc(total_dmg))
We can also prepare the data for the health question. For this question we will just look at the average number of fatalities by event type. Less processing is required here than above. The reason why we keep two different data sets is that the above data set has observations removed where the PROPDMGEXP and CROPDMGEXP don’t make sense. The data set below keeps those observations.
fatal_data <- data %>%
group_by(EVTYPE) %>%
summarize(fatalities = mean(FATALITIES)) %>%
arrange(desc(fatalities))
Below we show the highest 10 event types in terms of average fatalities. We show the same information in a bar chart. We see that the top three most fatal severe weather events on average are (1) tornadoes, thunderstorm wind, and hail with an average of 25 fatalities per event, (2) cold and snow with an average of 14 fatalities per event, and (3) tropical storm Gordon with an average of 8 fatalities.
head(fatal_data, n = 10)
## # A tibble: 10 x 2
## EVTYPE fatalities
## <chr> <dbl>
## 1 TORNADOES, TSTM WIND, HAIL 25.0
## 2 COLD AND SNOW 14.0
## 3 TROPICAL STORM GORDON 8.00
## 4 RECORD/EXCESSIVE HEAT 5.67
## 5 EXTREME HEAT 4.36
## 6 HEAT WAVE DROUGHT 4.00
## 7 HIGH WIND/SEAS 4.00
## 8 MARINE MISHAP 3.50
## 9 WINTER STORMS 3.33
## 10 Heavy surf and wind 3.00
fatal_data %>%
head(n = 10) %>%
ggplot(aes(x = fct_reorder(EVTYPE, fatalities), y = fatalities)) +
geom_bar(stat = "identity") +
coord_flip() +
ggtitle("Average Fatalities of Severe Weather Events") +
ylab("Fatalities per Event") +
xlab("Event Type")
Below we show the top 10 types of severe weather events in terms of total economic damage, which is the sum of property damage and crop damage. We also summarize our findings in a bar chart.
head(econ_data, n = 10)
## # A tibble: 10 x 4
## EVTYPE total_dmg prop_dmg crop_dmg
## <chr> <dbl> <dbl> <dbl>
## 1 TORNADOES, TSTM WIND, HAIL 1602500000. 1600000000. 2500000.
## 2 HURRICANE/TYPHOON 889338418. 810311970. 79026448.
## 3 HURRICANE OPAL 729000000. 722666667. 6333333.
## 4 RIVER FLOOD 631773062. 317477188. 314295875.
## 5 HURRICANE 182430412. 142887618. 39542794.
## 6 HURRICANE OPAL/HIGH WINDS 110000000. 100000000. 10000000.
## 7 HURRICANE ERIN 87336667. 85333333. 2003333.
## 8 WINTER STORM HIGH WINDS 65000000. 60000000. 5000000.
## 9 River Flooding 44670000. 35330000. 9340000.
## 10 STORM SURGE/TIDE 34128625. 34122375. 6250.
econ_data %>%
head(n = 10) %>%
ggplot(aes(x = fct_reorder(EVTYPE, total_dmg), y = total_dmg / 1e9)) +
geom_bar(stat = "identity") +
coord_flip() +
ggtitle("Average Economic Damage of Severe Weather Events") +
ylab("Damage in USD (Billions)") +
xlab("Event Type")