Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. In this report we explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database to understand what types of weather events have had the highest impact during the observed period of 1950 to November 2011.
From these data we found that tornados have been the most harmful to population health, both in terms of injuries and fatalities, while floods have had the most economic consequences.
We download and read the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. Let’s see the dimensions of the data we’re dealing with:
library(data.table)
if (!dir.exists("data")) dir.create("data")
if (!file.exists("data/repdata-data-StormData.csv.bz2")) download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "data/repdata-data-StormData.csv.bz2")
storm.data <- fread("data/repdata-data-StormData.csv.bz2")
dim(storm.data)
## [1] 902297 37
Let’s have a look at the columns with the information we’re interested in:
dt <- storm.data[, c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
head(dt)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1: TORNADO 0 15 25.0 K 0
## 2: TORNADO 0 0 2.5 K 0
## 3: TORNADO 0 2 25.0 K 0
## 4: TORNADO 0 2 2.5 K 0
## 5: TORNADO 0 2 2.5 K 0
## 6: TORNADO 0 6 2.5 K 0
Let’s look at damage magnitude (*DMGEXP) columns:
table(dt$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
table(dt$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
What do these numbers mean? There aren’t many occurencies so let’s ignore everything that’s not H (hundreds), K (thousands), M (Millions) or B (Billions).
Let’s convert damage values to the same unit:
dt[toupper(PROPDMGEXP) == "H", Property.Damage.Usd:= PROPDMG * 100]
dt[toupper(PROPDMGEXP) == "K", Property.Damage.Usd:= PROPDMG * 1000]
dt[toupper(PROPDMGEXP) == "M", Property.Damage.Usd:= PROPDMG * 1000000]
dt[toupper(PROPDMGEXP) == "B", Property.Damage.Usd:= PROPDMG * 1000000000]
dt[toupper(CROPDMGEXP) == "H", Crop.Damage.Usd:= CROPDMG * 100]
dt[toupper(CROPDMGEXP) == "K", Crop.Damage.Usd:= CROPDMG * 1000]
dt[toupper(CROPDMGEXP) == "M", Crop.Damage.Usd:= CROPDMG * 1000000]
dt[toupper(CROPDMGEXP) == "B", Crop.Damage.Usd:= CROPDMG * 1000000000]
Let’s see what kinds of events we have. Are they nicelly formatted?
length(unique(dt$EVTYPE))
## [1] 985
head(sort(table(dt$EVTYPE), decreasing = T), 20)
##
## HAIL TSTM WIND THUNDERSTORM WIND
## 288661 219940 82563
## TORNADO FLASH FLOOD FLOOD
## 60652 54277 25326
## THUNDERSTORM WINDS HIGH WIND LIGHTNING
## 20843 20212 15754
## HEAVY SNOW HEAVY RAIN WINTER STORM
## 15708 11723 11433
## WINTER WEATHER FUNNEL CLOUD MARINE TSTM WIND
## 7026 6839 6175
## MARINE THUNDERSTORM WIND WATERSPOUT STRONG WIND
## 5812 3796 3566
## URBAN/SML STREAM FLD WILDFIRE
## 3392 2761
There’s almost 1k different event types. We can see there’s mixed cases, special characters, and a lot of variation in spelling (TSTM WIND, THUNDERSTORM WIND and THUNDERSTORM WINDS).
We’re only interested in finding out the types of event that have had the highest impact, not in the precise numbers. For now let’s just upper case them and remove some special characters. Later on we’ll find out the ones with the biggest impact and fix them if needed.
dt$Event.Type <- as.factor(gsub("[^A-Z &/]+", "", toupper(dt$EVTYPE)))
length(unique(dt$Event.Type))
## [1] 775
Let’s see the 10 events that caused the most injuries:
calc.injuries.by.event <- function(dt) {
total.injuries <- sum(dt$INJURIES)
result <- dt[, .(Total=sum(INJURIES)), by = Event.Type]
result[,Pct.Injuries:= Total / total.injuries * 100]
result <- result[order(result$Pct.Injuries, decreasing = T)]
}
injuries.by.event <- calc.injuries.by.event(dt)
head(injuries.by.event, 10)
## Event.Type Total Pct.Injuries
## 1: TORNADO 91346 65.0019925
## 2: TSTM WIND 6957 4.9506148
## 3: FLOOD 6789 4.8310657
## 4: EXCESSIVE HEAT 6525 4.6432028
## 5: LIGHTNING 5230 3.7216782
## 6: HEAT 2100 1.4943641
## 7: ICE STORM 1975 1.4054139
## 8: FLASH FLOOD 1777 1.2645167
## 9: THUNDERSTORM WIND 1488 1.0588637
## 10: HAIL 1361 0.9684903
Let’s merge TSTM WIND and THUNDERSTORM WIND:
dt$Event.Type <- gsub("TSTM", "THUNDERSTORM", dt$Event.Type)
injuries.by.event <- calc.injuries.by.event(dt)
head(injuries.by.event, 10)
## Event.Type Total Pct.Injuries
## 1: TORNADO 91346 65.0019925
## 2: THUNDERSTORM WIND 8445 6.0094785
## 3: FLOOD 6789 4.8310657
## 4: EXCESSIVE HEAT 6525 4.6432028
## 5: LIGHTNING 5230 3.7216782
## 6: HEAT 2100 1.4943641
## 7: ICE STORM 1975 1.4054139
## 8: FLASH FLOOD 1777 1.2645167
## 9: HAIL 1361 0.9684903
## 10: WINTER STORM 1321 0.9400262
Now let’s look at top 10 fatalities causes:
calc.fatalities.by.event <- function(dt) {
total.fatalities <- sum(dt$FATALITIES)
result <- dt[, .(Total=sum(FATALITIES)), by = Event.Type]
result[,Pct.Fatalities:= Total / total.fatalities * 100]
result <- result[order(result$Pct.Fatalities, decreasing = T)]
}
fatalities.by.event <- calc.fatalities.by.event(dt)
head(fatalities.by.event, 10)
## Event.Type Total Pct.Fatalities
## 1: TORNADO 5633 37.193793
## 2: EXCESSIVE HEAT 1903 12.565203
## 3: FLASH FLOOD 978 6.457577
## 4: HEAT 937 6.186860
## 5: LIGHTNING 817 5.394520
## 6: THUNDERSTORM WIND 637 4.206009
## 7: FLOOD 470 3.103334
## 8: RIP CURRENT 368 2.429845
## 9: HIGH WIND 248 1.637504
## 10: AVALANCHE 224 1.479036
There are no duplicate events. We’ll leave it like this.
Let’s see economic consequences:
dt[(Property.Damage.Usd > 0 | Crop.Damage.Usd > 0), Economic.Damage.Usd:= ifelse(is.na(Property.Damage.Usd), 0, Property.Damage.Usd) + ifelse(is.na(Crop.Damage.Usd), 0, Crop.Damage.Usd)]
calc.economic.damage.by.event <- function(dt) {
total.economic.damage <- sum(dt$Economic.Damage.Usd, na.rm = T)
result <- dt[, .(Total=sum(Economic.Damage.Usd, na.rm = T)), by = Event.Type]
result[,Pct.Damage:= Total / total.economic.damage * 100]
result <- result[order(result$Pct.Damage, decreasing = T)]
}
economic.damage.by.event <- calc.economic.damage.by.event(dt)
head(economic.damage.by.event, 10)
## Event.Type Total Pct.Damage
## 1: FLOOD 150319678250 31.551737
## 2: HURRICANE/TYPHOON 71913712800 15.094514
## 3: TORNADO 57352113590 12.038070
## 4: STORM SURGE 43323541000 9.093506
## 5: HAIL 18758221670 3.937305
## 6: FLASH FLOOD 17562128610 3.686248
## 7: DROUGHT 15018672000 3.152383
## 8: HURRICANE 14610229010 3.066652
## 9: RIVER FLOOD 10148404500 2.130126
## 10: ICE STORM 8967041310 1.882160
Let’s merge HURRICANE/TYPHOON and HURRICANE
dt$Event.Type <- gsub("^HURRICANE$", "HURRICANE/TYPHOON", dt$Event.Type)
economic.damage.by.event <- calc.economic.damage.by.event(dt)
head(economic.damage.by.event, 10)
## Event.Type Total Pct.Damage
## 1: FLOOD 150319678250 31.551737
## 2: HURRICANE/TYPHOON 86523941810 18.161166
## 3: TORNADO 57352113590 12.038070
## 4: STORM SURGE 43323541000 9.093506
## 5: HAIL 18758221670 3.937305
## 6: FLASH FLOOD 17562128610 3.686248
## 7: DROUGHT 15018672000 3.152383
## 8: RIVER FLOOD 10148404500 2.130126
## 9: ICE STORM 8967041310 1.882160
## 10: THUNDERSTORM WIND 8936932980 1.875841
We found that tornados have been the most harmful events to population health, both in terms of injuries and in terms of fatalities
par(mar=c(4,4,2,0), mfrow = c(2, 1), oma = c(0, 0, 2, 0))
top.injuries.by.event <- injuries.by.event[1:5,]
barplot(top.injuries.by.event$Total / 1000, names.arg = top.injuries.by.event$Event.Type, cex.names=0.7)
title(ylab = "# of Injuries (k)")
top.fatalities.by.event <- fatalities.by.event[1:5,]
barplot(top.fatalities.by.event$Total / 1000, names.arg = top.fatalities.by.event$Event.Type, cex.names=0.7)
title(ylab = "# of Fatalities (k)")
mtext("Most harmful events to population health (1950-2011)", outer = TRUE, cex = 1.5)
We found that floods have had the greatest economic consequences
top.economic.damage.by.event <- economic.damage.by.event[1:5,]
barplot(top.economic.damage.by.event$Total / 1000000000, names.arg = top.economic.damage.by.event$Event.Type, cex.names=0.7)
title(main = "Events with greatest economic consequences (1950-2011)", ylab = "USD (Billions)")