For this analysis, I examined the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database, which tracks characteristics of major storms and weather events. Specifically, I examined the events from 1950 to 2011 in order to understand which event types were most harmful to population health and to the economy. I found that tornadoes have caused the most total harm to population health (judged by the number of fatalities and injuries), and floods caused the most total harm to the economy (judged by property damage and crop damage). Although I focused on figuring out which event types caused the most total harm, I also analyzed which event types caused the most harm on a per-event basis. I did not take geography or changes over time into account in my analysis, which could form the basis for a useful follow-up study.
To begin, I downloaded the Storm Data file and saved it in my working directory. This file contains storm data from 1950 through 2011. More information about the data is available in the FAQ.
# read in the data and determine the dimensions
storm <- read.csv("repdata-data-StormData.csv.bz2", as.is = TRUE)
dim(storm)
## [1] 902297 37
Since this analysis is focused on event types, I checked the number of unique values in the event type field:
# determine the number of unique event types
events <- unique(storm$EVTYPE)
length(events)
## [1] 985
It contains 985 unique values. This was surprising, since the Storm Data Documentation listed only 48 event types.
Upon further exploration, I determined that many of those EVTYPE values were minor variations of the 48 “official” event types. For example:
# show event types containing 'FLASH'
events[grep("FLASH", events)]
## [1] "ICE STORM/FLASH FLOOD" "FLASH FLOOD"
## [3] "FLASH FLOODING" "FLASH FLOODING/THUNDERSTORM WI"
## [5] "FLASH FLOODS" "FLOOD/FLASH FLOOD"
## [7] "FLASH FLOOD WINDS" "FLASH FLOOD/"
## [9] "THUNDERSTORM WINDS/FLASH FLOOD" "LOCAL FLASH FLOOD"
## [11] "FLOOD/FLASH FLOODING" "FLASH FLOOD/FLOOD"
## [13] "FLASH FLOOD FROM ICE JAMS" "FLASH FLOOD - HEAVY RAIN"
## [15] "FLASH FLOOD/ STREET" "FLASH FLOOD/HEAVY RAIN"
## [17] "FLOOD FLASH" "FLOOD FLOOD/FLASH"
## [19] "FLOOD/FLASH" "FLASH FLOOD/ FLOOD"
## [21] "FLASH FLOODING/FLOOD" "FLASH FLOOODING"
## [23] "FLOOD/FLASHFLOOD" "FLASH FLOOD/LANDSLIDE"
## [25] "FLASH FLOOD LANDSLIDES" "FLOOD/FLASH/FLOOD"
## [27] " FLASH FLOOD"
Although I considered attempting to “fix” every unofficial event type by hand, I decided that this was unnecessary since my analysis is focused on event types with the largest impacts, and 817 of the 985 event types had less than 10 occurrences in the dataset:
# calculate the number of event types with less than 10 occurrences
suppressMessages(library(dplyr))
storm %>% select(EVTYPE) %>% group_by(EVTYPE) %>% summarise(count = n()) %>%
filter(count < 10) %>% nrow()
## [1] 817
Instead, I focused on combining duplicate event types in which there were at least 50 occurrences in the dataset:
# display the event types with at least 50 occurrences
event.topcounts <- storm %>% select(EVTYPE) %>% group_by(EVTYPE) %>% summarise(count = n()) %>%
filter(count >= 50) %>% data.frame()
event.topcounts
## EVTYPE count
## 1 ASTRONOMICAL HIGH TIDE 103
## 2 ASTRONOMICAL LOW TIDE 174
## 3 AVALANCHE 386
## 4 BLIZZARD 2719
## 5 COASTAL FLOOD 650
## 6 COASTAL FLOODING 143
## 7 COLD 72
## 8 COLD/WIND CHILL 539
## 9 DENSE FOG 1293
## 10 DROUGHT 2488
## 11 DRY MICROBURST 186
## 12 DUST DEVIL 141
## 13 DUST STORM 427
## 14 EXCESSIVE HEAT 1678
## 15 EXTREME COLD 655
## 16 EXTREME COLD/WIND CHILL 1002
## 17 EXTREME WINDCHILL 204
## 18 FLASH FLOOD 54277
## 19 FLASH FLOODING 682
## 20 FLOOD 25326
## 21 FLOOD/FLASH FLOOD 624
## 22 FLOODING 120
## 23 FOG 538
## 24 FREEZE 74
## 25 FREEZING RAIN 250
## 26 FROST 53
## 27 FROST/FREEZE 1342
## 28 FUNNEL CLOUD 6839
## 29 FUNNEL CLOUDS 87
## 30 GUSTY WINDS 53
## 31 HAIL 288661
## 32 HEAT 767
## 33 HEAT WAVE 74
## 34 HEAVY RAIN 11723
## 35 HEAVY SNOW 15708
## 36 HEAVY SURF 84
## 37 HEAVY SURF/HIGH SURF 228
## 38 HIGH SURF 725
## 39 HIGH WIND 20212
## 40 HIGH WINDS 1533
## 41 HURRICANE 174
## 42 HURRICANE/TYPHOON 88
## 43 ICE 61
## 44 ICE STORM 2006
## 45 LAKE-EFFECT SNOW 636
## 46 LANDSLIDE 600
## 47 LIGHT SNOW 154
## 48 LIGHTNING 15754
## 49 MARINE HAIL 442
## 50 MARINE HIGH WIND 135
## 51 MARINE THUNDERSTORM WIND 5812
## 52 MARINE TSTM WIND 6175
## 53 MODERATE SNOWFALL 101
## 54 RECORD COLD 64
## 55 RECORD HEAT 81
## 56 RECORD WARMTH 146
## 57 RIP CURRENT 470
## 58 RIP CURRENTS 304
## 59 RIVER FLOOD 173
## 60 SLEET 59
## 61 SNOW 587
## 62 STORM SURGE 261
## 63 STORM SURGE/TIDE 148
## 64 STRONG WIND 3566
## 65 STRONG WINDS 196
## 66 THUNDERSTORM WIND 82563
## 67 THUNDERSTORM WINDS 20843
## 68 THUNDERSTORM WINDS HAIL 61
## 69 THUNDERSTORM WINDSS 51
## 70 TORNADO 60652
## 71 TROPICAL DEPRESSION 60
## 72 TROPICAL STORM 690
## 73 TSTM WIND 219940
## 74 TSTM WIND/HAIL 1028
## 75 UNSEASONABLY DRY 56
## 76 UNSEASONABLY WARM 126
## 77 URBAN FLOOD 249
## 78 URBAN FLOODING 98
## 79 URBAN/SML STREAM FLD 3392
## 80 WATERSPOUT 3796
## 81 WILD/FOREST FIRE 1457
## 82 WILDFIRE 2761
## 83 WIND 340
## 84 WINTER STORM 11433
## 85 WINTER WEATHER 7026
## 86 WINTER WEATHER/MIX 1104
## 87 WINTRY MIX 90
I only combined clear duplicates, and generally used the name specified by the Storm Data documentation:
# fix duplicates
storm$EVTYPE[storm$EVTYPE == "COASTAL FLOODING"] <- "COASTAL FLOOD"
storm$EVTYPE[storm$EVTYPE == "COLD"] <- "COLD/WIND CHILL"
storm$EVTYPE[storm$EVTYPE == "EXTREME COLD"] <- "EXTREME COLD/WIND CHILL"
storm$EVTYPE[storm$EVTYPE == "EXTREME WINDCHILL"] <- "EXTREME COLD/WIND CHILL"
storm$EVTYPE[storm$EVTYPE == "FLASH FLOODING"] <- "FLASH FLOOD"
storm$EVTYPE[storm$EVTYPE == "FLOOD/FLASH FLOOD"] <- "FLASH FLOOD"
storm$EVTYPE[storm$EVTYPE == "FLOODING"] <- "FLOOD"
storm$EVTYPE[storm$EVTYPE == "URBAN FLOOD"] <- "FLOOD"
storm$EVTYPE[storm$EVTYPE == "URBAN FLOODING"] <- "FLOOD"
storm$EVTYPE[storm$EVTYPE == "URBAN/SML STREAM FLD"] <- "FLOOD"
storm$EVTYPE[storm$EVTYPE == "FREEZE"] <- "FROST/FREEZE"
storm$EVTYPE[storm$EVTYPE == "FROST"] <- "FROST/FREEZE"
storm$EVTYPE[storm$EVTYPE == "FUNNEL CLOUDS"] <- "FUNNEL CLOUD"
storm$EVTYPE[storm$EVTYPE == "HEAT WAVE"] <- "HEAT"
storm$EVTYPE[storm$EVTYPE == "HEAVY SURF"] <- "HIGH SURF"
storm$EVTYPE[storm$EVTYPE == "HEAVY SURF/HIGH SURF"] <- "HIGH SURF"
storm$EVTYPE[storm$EVTYPE == "HIGH WINDS"] <- "HIGH WIND"
storm$EVTYPE[storm$EVTYPE == "HURRICANE"] <- "HURRICANE/TYPHOON"
storm$EVTYPE[storm$EVTYPE == "TYPHOON"] <- "HURRICANE/TYPHOON"
storm$EVTYPE[storm$EVTYPE == "ICE"] <- "ICE STORM"
storm$EVTYPE[storm$EVTYPE == "MARINE TSTM WIND"] <- "MARINE THUNDERSTORM WIND"
storm$EVTYPE[storm$EVTYPE == "RIP CURRENTS"] <- "RIP CURRENT"
storm$EVTYPE[storm$EVTYPE == "STORM SURGE/TIDE"] <- "STORM SURGE"
storm$EVTYPE[storm$EVTYPE == "STRONG WINDS"] <- "STRONG WIND"
storm$EVTYPE[storm$EVTYPE == "THUNDERSTORM WINDS"] <- "THUNDERSTORM WIND"
storm$EVTYPE[storm$EVTYPE == "THUNDERSTORM WINDS HAIL"] <- "THUNDERSTORM WIND"
storm$EVTYPE[storm$EVTYPE == "THUNDERSTORM WINDSS"] <- "THUNDERSTORM WIND"
storm$EVTYPE[storm$EVTYPE == "TSTM WIND"] <- "THUNDERSTORM WIND"
storm$EVTYPE[storm$EVTYPE == "TSTM WIND/HAIL"] <- "THUNDERSTORM WIND"
storm$EVTYPE[storm$EVTYPE == "WILD/FOREST FIRE"] <- "WILDFIRE"
storm$EVTYPE[storm$EVTYPE == "WINTER WEATHER/MIX"] <- "WINTER WEATHER"
Next, I wanted to create a new variable for property damage amount (PROPDMGAMT) by multiplying the base figure (PROPDMG) by the magnitute (PROPDMGEXP). I also created a new variable for crop damage amount (CROPDMGAMT) by multiplying the base figure (CROPDMG) by the magnitute (CROPDMGEXP). Per page 12 of the documentation, “K” means thousands, “M” means millions, and “B” means billions.
However, the PROPDMGEXP and CROPDMGEXP variables do not always contain proper values:
table(storm$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
table(storm$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
I decided to treat everything other than K/k, M/m, and B/b as invalid. Thus, any record with an invalid magnitude would be assigned an amount of zero.
# copy data frame variables to vectors
PROPDMG <- storm$PROPDMG
PROPDMGEXP <- storm$PROPDMGEXP
CROPDMG <- storm$CROPDMG
CROPDMGEXP <- storm$CROPDMGEXP
# change lowercase k/m/b to uppercase K/M/B
PROPDMGEXP <- toupper(PROPDMGEXP)
CROPDMGEXP <- toupper(CROPDMGEXP)
# calculate PROPDMGAMT
PROPDMGAMT <- ifelse(PROPDMGEXP == "K", PROPDMG * 1000, ifelse(PROPDMGEXP ==
"M", PROPDMG * 1e+06, ifelse(PROPDMGEXP == "B", PROPDMG * 1e+09, 0)))
# calculate CROPDMGAMT
CROPDMGAMT <- ifelse(CROPDMGEXP == "K", CROPDMG * 1000, ifelse(CROPDMGEXP ==
"M", CROPDMG * 1e+06, ifelse(CROPDMGEXP == "B", CROPDMG * 1e+09, 0)))
# add new variables to data frame
storm$PROPDMGAMT <- PROPDMGAMT
storm$CROPDMGAMT <- CROPDMGAMT
Finally, I created a new variable called “HARMHEALTH” that combined the “FATALITIES” and “INJURIES” variables into a single number. The formula I decided to use was “HARMHEALTH = INJURIES + 20*FATALITIES”, meaning that fatalities were weighted 20 times as much as injuries.
# calculate HARMHEALTH for each event
storm$HARMHEALTH <- storm$INJURIES + 20 * storm$FATALITIES
The first question I had to answer was this: Across the United States, which types of events are most harmful with respect to population health? I answered this question by calculating the total “HARMHEALTH” score for every type of event, and then plotting the events with the top 10 scores:
# calculate the total HARMHEALTH for each event type and save it to a data
# frame
stormhealth <- storm %>% select(EVTYPE, HARMHEALTH) %>% group_by(EVTYPE) %>%
summarise(HARMHEALTHTOTAL = sum(HARMHEALTH), COUNT = n()) %>% arrange(desc(HARMHEALTHTOTAL)) %>%
data.frame()
# plot the top 10 event types
library(ggplot2)
ggplot(stormhealth[1:10, ]) + aes(x = reorder(EVTYPE, -HARMHEALTHTOTAL), y = HARMHEALTHTOTAL) +
geom_bar(stat = "identity") + labs(title = "Top 10 Storm Event Types Ranked by Total Harm to Human Health",
x = "Storm Event Type", y = "HARMHEALTH Score") + theme(axis.text.x = element_text(angle = 45,
hjust = 1, vjust = 1))
As you can see, tornadoes have (by far) the highest impact on human health, followed by “excessive heat” and “heat”.
It is also interesting to see the average “harm per event” for these same 10 event types:
# calculate the average 'harm per event' for each event type
stormhealth$HARMPEREVENT <- round(stormhealth$HARMHEALTHTOTAL/stormhealth$COUNT,
2)
head(stormhealth, 10)
## EVTYPE HARMHEALTHTOTAL COUNT HARMPEREVENT
## 1 TORNADO 204006 60652 3.36
## 2 EXCESSIVE HEAT 44585 1678 26.57
## 3 HEAT 24589 841 29.24
## 4 THUNDERSTORM WIND 23572 324486 0.07
## 5 FLASH FLOOD 22080 55583 0.40
## 6 LIGHTNING 21550 15754 1.37
## 7 FLOOD 16950 29185 0.58
## 8 RIP CURRENT 11969 774 15.46
## 9 HIGH WIND 7099 21745 0.33
## 10 EXTREME COLD/WIND CHILL 6300 1861 3.39
On a per-event basis, you can see that heat and rip currents are the most dangerous out of these 10 event types.
Finally, let's take a look at which event types had the highest harm-per-event (limiting to event types with at least 50 occurrences):
# show the event types with the highest harm-per-event
stormhealth %>% filter(COUNT >= 50) %>% arrange(desc(HARMPEREVENT)) %>% head(10)
## EVTYPE HARMHEALTHTOTAL COUNT HARMPEREVENT
## 1 HEAT 24589 841 29.24
## 2 EXCESSIVE HEAT 44585 1678 26.57
## 3 RIP CURRENT 11969 774 15.46
## 4 HURRICANE/TYPHOON 3826 273 14.01
## 5 AVALANCHE 4650 386 12.05
## 6 COLD/WIND CHILL 2660 611 4.35
## 7 FOG 1974 538 3.67
## 8 EXTREME COLD/WIND CHILL 6300 1861 3.39
## 9 TORNADO 204006 60652 3.36
## 10 HIGH SURF 3240 1037 3.12
Whether a government official should focus on events with the highest total harm or events with the highest harm-per-event depends on many factors. Additional factors that the government official might take into consideration include geography and changes over time, neither of which were considered in my analysis.
The second question I had to answer was this: Across the United States, which types of events have the greatest economic consequences? I answered this question by calculating the total dollar amount of property and crop damages for every type of event, and then plotting the events with the top 10 damage totals:
# calculate the damage totals for each event type and save it to a data
# frame
stormecon <- storm %>% select(EVTYPE, PROPDMGAMT, CROPDMGAMT) %>% group_by(EVTYPE) %>%
summarise(Property = sum(PROPDMGAMT), Crop = sum(CROPDMGAMT), DMGTOTAL = Property +
Crop, COUNT = n()) %>% arrange(desc(DMGTOTAL)) %>% data.frame()
# save the top 10 event types to a new data frame
stormecon.top <- stormecon[1:10, ]
# reshape the data so that it can be plotted by damage type
library(reshape2)
stormecon.melt <- melt(stormecon.top, id = "EVTYPE", measure.vars = c("Property",
"Crop"))
stormecon.melt$EVTYPE <- factor(stormecon.melt$EVTYPE, levels = stormecon.top$EVTYPE)
# plot the top 10 event types
ggplot(stormecon.melt) + aes(x = EVTYPE, y = value/1e+09, fill = variable) +
geom_bar(stat = "identity") + labs(title = "Top 10 Storm Event Types Ranked by Total Damage to Property and Crops",
x = "Storm Event Type", y = "Total Damages (in billions of USD)", fill = "Type of Damage") +
theme(legend.position = "top", axis.text.x = element_text(angle = 45, hjust = 1,
vjust = 1))
The types of the events with the greatest economic harm are floods, hurricanes/typhoons, tornadoes, and storm surges. As you can see, almost all of the damage totals are from property damage (rather than crop damage), with the exception of droughts.
It is also interesting to see the average “damage per event” for these same 10 event types:
# calculate the average 'damage per event' for each event type
stormecon$DMGPEREVENT <- round(stormecon$DMGTOTAL/stormecon$COUNT, 0)
head(stormecon, 10)
## EVTYPE Property Crop DMGTOTAL COUNT DMGPEREVENT
## 1 FLOOD 1.448e+11 5.681e+09 1.505e+11 29185 5157744
## 2 HURRICANE/TYPHOON 8.177e+10 5.351e+09 8.712e+10 273 319139182
## 3 TORNADO 5.694e+10 4.150e+08 5.735e+10 60652 945593
## 4 STORM SURGE 4.796e+10 8.550e+05 4.797e+10 409 117275254
## 5 HAIL 1.573e+10 3.026e+09 1.876e+10 288661 64984
## 6 FLASH FLOOD 1.662e+10 1.531e+09 1.815e+10 55583 326612
## 7 DROUGHT 1.046e+09 1.397e+10 1.502e+10 2488 6036444
## 8 THUNDERSTORM WIND 9.751e+09 1.224e+09 1.098e+10 324486 33823
## 9 RIVER FLOOD 5.119e+09 5.029e+09 1.015e+10 173 58661298
## 10 ICE STORM 3.958e+09 5.022e+09 8.980e+09 2067 4344314
On a per-event basis, you can see that hurricanes/typhoons and storm surges cause the most damage out of these 10 event types.
Finally, let's take a look at which event types had the highest damage-per-event (limiting to event types with at least 50 occurrences):
# show the event types with the highest damage-per-event
stormecon %>% filter(COUNT >= 50) %>% arrange(desc(DMGPEREVENT)) %>% head(10)
## EVTYPE Property Crop DMGTOTAL COUNT DMGPEREVENT
## 1 HURRICANE/TYPHOON 8.177e+10 5.351e+09 8.712e+10 273 319139182
## 2 STORM SURGE 4.796e+10 8.550e+05 4.797e+10 409 117275254
## 3 RIVER FLOOD 5.119e+09 5.029e+09 1.015e+10 173 58661298
## 4 TROPICAL STORM 7.704e+09 6.783e+08 8.382e+09 690 12148169
## 5 DROUGHT 1.046e+09 1.397e+10 1.502e+10 2488 6036444
## 6 FLOOD 1.448e+11 5.681e+09 1.505e+11 29185 5157744
## 7 ICE STORM 3.958e+09 5.022e+09 8.980e+09 2067 4344314
## 8 WILDFIRE 7.767e+09 4.023e+08 8.169e+09 4218 1936750
## 9 FROST/FREEZE 9.700e+06 1.606e+09 1.616e+09 1469 1100076
## 10 TORNADO 5.694e+10 4.150e+08 5.735e+10 60652 945593
As I mentioned in the section on Population Health: Whether a government official should focus on events with the highest total damage or events with the highest damage-per-event depends on many factors. Additional factors that the government official might take into consideration include geography and changes over time, neither of which were considered in my analysis.
< END >