In this analysis we attempt to answer the following questions using the NOAA storm dataset from 2007:
Q1. Across the United States, which types of events are most harmful with respect to population health?
Q2. Across the United States, which types of events have the greatest economic consequences?
To address these questions, we examine the injury/fatality counts as a proxy for “population health” and property/crop damage cost as a proxy for “economic consequences”. For each, we arranged by type of storm incident. Due to the complicated and error prone nature of the types column in the dataset, decisions were made by the author to organize the types into more comprehensive categories for ease of the reader and legibility. Details for those decisions can be found in the next section. At the end, we found the answer to question 1 to be tornado based events with approximately 100,000 fatalities/injuries combined, and this was the largest by a wide margin. We found the answer to question 2 to be thunderstorm based incidents, with total damage costing almost half a trillion dollars.
First we read in the raw csv file.
df<-read.csv("./repdata_data_StormData/repdata_data_StormData.csv")
From here, as the analysis hinges on the event type so we take a look at it.
result <- tapply(df$FATALITIES,df$EVTYPE,sum)
result <- as.data.frame(as.table(result))
result <- filter(result,Freq>0)
head(result)
## Var1 Freq
## 1 AVALANCE 1
## 2 AVALANCHE 224
## 3 BLACK ICE 1
## 4 BLIZZARD 101
## 5 blowing snow 1
## 6 BLOWING SNOW 1
We see that the labels for the nonzero EVTYPES are messy so we clean them up a bit.
df$EVTYPE <- trimws(df$EVTYPE)
df$EVTYPE <- toupper(df$EVTYPE)
Now we try to consolidate some categories. To answer the question at hand, we filter for nonzero fatalities or injuries by adding those columns together in a new column, HARMHEALTH and checking if it is nonzero. In an effort to not lose any information, we’ll create a new column, EVCAT, to classify the entries that are similar.
df <- mutate(df, HARMHEALTH = FATALITIES + INJURIES)
df2 <- filter(df, HARMHEALTH > 0)
cold <- c("BLACK ICE","BLIZZARD","BLOWING SNOW","COLD","COLD AND SNOW",
"COLD TEMPERATURE","COLD WAVE","COLD WEATHER","COLD/WIND CHILL","COLD/WINDS",
"EXTENDED COLD","EXTREME COLD","EXTREME COLD/WIND CHILL","EXTREME WINDCHILL",
"LOW TEMPERATURE","RECORD COLD","FALLING SNOW/ICE","FOG AND COLD TEMPERATURES",
"FREEZE","FREEZING DRIZZLE","FREEZING RAIN/SNOW",
"FROST","HYPOTHERMIA","HYPOTHERMIA/EXPOSURE","ICE","ICE ON ROAD","ICE STORM",
"ICY ROADS","LIGHT SNOW","RAIN/SNOW","SLEET","SNOW","SNOW AND ICE","SNOW SQUALL",
"SNOW/ BITTER COLD","UNSEASONABLY COLD","WINTER STORM","WINTER STORMS",
"WINTER STORM HIGH WINDS","WINTER WEATHER","WINTER WEATHER/MIX",
"WINTRY MIX","WINTER WEATHER MIX","SNOW SQUALLS","ICE STORM/FLASH FLOOD","ICE ROADS",
"HEAVY SNOW/ICE","HEAVY SNOW SHOWER","HEAVY SNOW/BLIZZARD/AVALANCHE","HEAVY SNOW",
"HEAVY SNOW AND HIGH WINDS","GLAZE","GLAZE/ICE STORM","FREEZING SPRAY",
"FREEZING RAIN","EXCESSIVE SNOW","SNOW/HIGH WIND","HIGH WIND/HEAVY SNOW",
"HIGH WINDS/COLD","HIGH WINDS/SNOW","SNOW/HIGH WINDS","THUNDERSNOW")
flood <- c("COASTAL FLOOD","COASTAL FLOODING","EXCESSIVE RAINFALL","FLASH FLOOD",
"FLASH FLOOD/FLOOD","FLASH FLOODING","FLASH FLOODING/FLOOD","FLASH FLOODS",
"FLOOD","FLOOD & HEAVY RAIN","FLOOD/FLASH FLOOD","FLOOD/RIVER FLOOD","FLOODING",
"HEAVY RAINS","MINOR FLOODING","MIXED PRECIP","RAIN/WIND","RAPIDLY RISING WATER",
"RIVER FLOOD","RIVER FLOODING","TIDAL FLOODING","TORRENTIAL RAINFALL",
"URBAN AND SMALL STREAM FLOODIN","COASTAL FLOODING/EROSION","HEAVY RAIN","URBAN/SML STREAM FLD")
heat <- c("DROUGHT","DROUGHT/EXCESSIVE HEAT","EXCESSIVE HEAT","EXTREME HEAT",
"HEAT","HEAT WAVE","HEAT WAVE DROUGHT","HEAT WAVES","HYPERTHERMIA/EXPOSURE",
"RECORD HEAT","RECORD/EXCESSIVE HEAT","UNSEASONABLY WARM",
"UNSEASONABLY WARM AND DRY","WARM WEATHER")
av <- c("AVALANCHE","AVALANCE","LANDSLIDE","LANDSLIDES","MUDSLIDE","MUDSLIDES")
fires <- c("BRUSH FIRE", "WILD FIRES","WILD/FOREST FIRE","WILDFIRE")
surf <- c("COASTAL STORM","HAZARDOUS SURF","HEAVY SEAS","HEAVY SURF","HEAVY SURF AND WIND",
"HEAVY SURF/HIGH SURF","HIGH SEAS","HIGH SURF","HIGH SWELLS","HIGH WATER","HIGH WAVES",
"HIGH WIND AND SEAS","HIGH WIND/SEAS","MARINE ACCIDENT","MARINE HIGH WIND",
"MARINE MISHAP","MARINE STRONG WIND","MARINE THUNDERSTORM WIND","MARINE TSTM WIND",
"RIP CURRENT","RIP CURRENTS/HEAVY SURF","ROGUE WAVE","ROUGH SEAS","ROUGH SURF",
"STORM SURGE/TIDE","STORM SURGE","COASTALSTORM","RIP CURRENTS")
fog <- c("DENSE FOG","FOG")
wind <- c("DRY MICROBURST","DRY MIRCOBURST WINDS","DUST DEVIL","DUST STORM",
"FUNNEL CLOUD","GUSTY WIND","GUSTY WINDS","HIGH WIND","HIGH WIND 48","HIGH WINDS",
"NON-SEVERE WIND DAMAGE","NON TSTM WIND","STRONG WIND","STRONG WINDS","WIND",
"WHIRLWIND","WINDSTORM","WINDS","WIND STORM")
tstm <- c("HAIL","LIGHTNING","LIGHTNING AND THUNDERSTORM WIN","LIGHTNING INJURY",
"SMALL HAIL","THUNDERSTORM","THUNDERSTORM WINDS","THUNDERSTORM WIND","THUNDERSTORM WIND (G40)",
"THUNDERSTORM WIND G52","THUNDERSTORM WINDS","THUNDERSTORM WINDS 13","THUNDERSTORM WINDS/HAIL",
"THUNDERSTORM WINDSS","THUNDERSTORMS WINDS","THUNDERSTORMW","THUNDERTORM WINDS","TSTM WIND",
"TSTM WIND (G35)","TSTM WIND (G40)","TSTM WIND (G45)","TSTM WIND/HAIL","LIGHTNING.")
hurr <- c("HURRICANE","HURRICANE-GENERATED SWELLS","HURRICANE EDOUARD",
"HURRICANE EMILY","HURRICANE ERIN","HURRICANE FELIX","HURRICANE OPAL",
"HURRICANE OPAL/HIGH WINDS","HURRICANE/TYPHOON","TROPICAL STORM",
"TROPICAL STORM GORDON","TYPHOON","WATERSPOUT","WATERSPOUT TORNADO","WATERSPOUT/TORNADO")
torr <- c("TORNADO","TORNADO F2","TORNADO F3","TORNADOES, TSTM WIND, HAIL")
for (i in 1:nrow(df2)) {
if (df2$EVTYPE[i] %in% cold) {df2$EVCAT[i] <- "COLD/ICE/SNOW"}
else if (df2$EVTYPE[i] %in% flood) {df2$EVCAT[i] <- "FLOOD/RAIN"}
else if (df2$EVTYPE[i] %in% heat) {df2$EVCAT[i] <- "HEAT/DROUGHT"}
else if (df2$EVTYPE[i] %in% av) {df2$EVCAT[i] <- "AVALANCHE/LANDSLIDE"}
else if (df2$EVTYPE[i] %in% fires) {df2$EVCAT[i] <- "FIRES"}
else if (df2$EVTYPE[i] %in% surf) {df2$EVCAT[i] <- "SURF/MARINE"}
else if (df2$EVTYPE[i] %in% fog) {df2$EVCAT[i] <- "FOG"}
else if (df2$EVTYPE[i] %in% wind) {df2$EVCAT[i] <- "WIND"}
else if (df2$EVTYPE[i] %in% tstm) {df2$EVCAT[i] <- "TSTM/LIGHTNING/HAIL"}
else if (df2$EVTYPE[i] == "TSUNAMI") {df2$EVCAT[i] <- "TSUNAMI"}
else if (df2$EVTYPE[i] %in% hurr) {df2$EVCAT[i] <- "HURRICANE/TYPHOON"}
else if (df2$EVTYPE[i] %in% torr) {df2$EVCAT[i] <- "TORNADO"}
else {df2$EVCAT[i] <- "OTHER"}
}
We have appropriately categorized the harm based data for downstream analysis. Next we do the same thing for economic damage. We include both the property damage and crop damage as one statistic to do the analysis. To do this, We need to get all the money in the same denomination, which in this analysis will be billions of dollars. Afterwards, we shall categorize as before.
df3 <- filter(df, PROPDMG > 0 | CROPDMG > 0)
cold <- c(cold,df3$EVTYPE[grep("ice|freez|frozen|frost|snow|blizz|winter|chill|UNSEASONABLE COLD",df3$EVTYPE,ignore.case = TRUE)])
flood <- c(flood,df3$EVTYPE[grep("flood|wet|rain|precipitation|shower",df3$EVTYPE,ignore.case = TRUE)])
heat <- c(heat,df3$EVTYPE[grep("heat",df3$EVTYPE,ignore.case = TRUE)])
av <- c(av,df3$EVTYPE[grep("slide|slump",df3$EVTYPE,ignore.case = TRUE)])
fires <- c(fires,df3$EVTYPE[grep("fire|smoke",df3$EVTYPE,ignore.case = TRUE)])
surf <- c(surf,df3$EVTYPE[grep("surf|coast|tide|beach|wave|swell|seiche",df3$EVTYPE,ignore.case = TRUE)])
tstm <- c(tstm,df3$EVTYPE[grep("thunderstorm|tstm|hail|light|LIGNTNING|THUD|thund|tund",df3$EVTYPE,ignore.case = TRUE)])
hurr <- c(hurr,df3$EVTYPE[grep("hurricane|typhoon|tropical|waterspout",df3$EVTYPE,ignore.case = TRUE)])
torr <- c(torr,df3$EVTYPE[grep("tornado|landspout|torn",df3$EVTYPE,ignore.case = TRUE)])
wind <- c(wind,df3$EVTYPE[grep("gust|blow|wind|downburst|microburst|turbulence",df3$EVTYPE,ignore.case = TRUE)])
for (i in 1:nrow(df3)) {
if (df3$PROPDMGEXP[i] == "K") {df3$PROPDMG[i] <- df3$PROPDMG[i]/1000000}
else if (df3$CROPDMGEXP[i] == "K") {df3$CROPDMG[i] <- df3$CROPDMG[i]/1000000}
else if (df3$PROPDMGEXP[i] == "M") {df3$PROPDMG[i] <- df3$PROPDMG[i]/1000}
else if (df3$CROPDMGEXP[i] == "M") {df3$CROPDMG[i] <- df3$CROPDMG[i]/1000}
}
df3 <- mutate(df3,ECONDMG=PROPDMG+CROPDMG)
for (i in 1:nrow(df3)) {
if (df3$EVTYPE[i] %in% cold) {df3$EVCAT[i] <- "COLD/ICE/SNOW"}
else if (df3$EVTYPE[i] %in% flood) {df3$EVCAT[i] <- "FLOOD/RAIN"}
else if (df3$EVTYPE[i] %in% heat) {df3$EVCAT[i] <- "HEAT/DROUGHT"}
else if (df3$EVTYPE[i] %in% av) {df3$EVCAT[i] <- "AVALANCHE/LANDSLIDE"}
else if (df3$EVTYPE[i] %in% fires) {df3$EVCAT[i] <- "FIRES"}
else if (df3$EVTYPE[i] %in% surf) {df3$EVCAT[i] <- "SURF/MARINE"}
else if (df3$EVTYPE[i] %in% wind) {df3$EVCAT[i] <- "WIND"}
else if (df3$EVTYPE[i] %in% tstm) {df3$EVCAT[i] <- "TSTM/LIGHTNING/HAIL"}
else if (df3$EVTYPE[i] == "TSUNAMI") {df3$EVCAT[i] <- "TSUNAMI"}
else if (df3$EVTYPE[i] %in% hurr) {df3$EVCAT[i] <- "HURRICANE/TYPHOON"}
else if (df3$EVTYPE[i] %in% torr) {df3$EVCAT[i] <- "TORNADO"}
else {df3$EVCAT[i] <- "OTHER"}
}
Now that the data is categorized, we plot the results to make some conclusions. With regard to Q1:
result <- tapply(df2$HARMHEALTH,df2$EVCAT,sum)
result <- as.data.frame(as.table(result))
ggplot(result,aes(x=Var1,y=Freq))+geom_bar(stat="identity")+
labs(title="US Injuries/Fatalities From Storm Events",x="Events",y="Fatal/Injury Count")+
theme(axis.text.x=element_text(angle=45,hjust=1))
## Category Fatalities/Injuries
## 1 TORNADO 97022
## 2 TSTM/LIGHTNING/HAIL 17654
## 3 HEAT/DROUGHT 12426
## 4 FLOOD/RAIN 10644
## 5 COLD/ICE/SNOW 7846
## 6 WIND 2815
## 7 HURRICANE/TYPHOON 1995
## 8 SURF/MARINE 1755
## 9 FIRES 1698
## 10 FOG 1156
## 11 AVALANCHE/LANDSLIDE 494
## 12 TSUNAMI 162
## 13 OTHER 6
We see from the plot that tornado based events is by far the largest source of fatalities/injuries in the US with almost 100,000 people. Conceptually, this makes sense as tornadoes are hard to plan for as they can be quite sudden and extremely violent. In a distant second is thunderstorm based incidents with almost an order of magnitude less frequency.
With regard to Q2,
result2 <- tapply(df3$ECONDMG,df3$EVCAT,sum)
result2 <- as.data.frame(as.table(result2))
ggplot(result2,aes(x=Var1,y=Freq))+geom_bar(stat="identity")+
labs(title="US Economic Damage (Billions) From Storm Events",x="Events",y="Dollars (Billions)")+
theme(axis.text.x=element_text(angle=45,hjust=1))
## Category Dollars (Billion)
## 1 TSTM/LIGHTNING/HAIL 4.114682e+05
## 2 FLOOD/RAIN 2.866019e+05
## 3 WIND 1.869108e+05
## 4 TORNADO 7.525974e+04
## 5 HEAT/DROUGHT 1.672675e+04
## 6 COLD/ICE/SNOW 1.427534e+04
## 7 HURRICANE/TYPHOON 1.307078e+04
## 8 FIRES 6.968419e+03
## 9 SURF/MARINE 1.817648e+03
## 10 AVALANCHE/LANDSLIDE 1.613370e+02
## 11 TSUNAMI 1.433008e+02
## 12 OTHER 2.674645e-02
We see from the plot that thunderstorm based events, including things like lightning and hail has the highest economic cost totaling almost half a trillion dollars (.4 trillion). This makes sense intuitively because these things can have catastrophic effects on both crops and property, which is how we are measuring economic consequence. In second, with about three-fourths as much is flood/rain at roughly .3 trillion dollars. Again, this seems a reasonable outcome in reference to our cost categories of property and crops.