Synopsis:

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. In this report we explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database to understand what types of weather events have had the highest impact during the observed period of 1950 to November 2011.

From these data we found that tornados have been the most harmful to population health, both in terms of injuries and fatalities, while floods have had the most economic consequences.

Data Processing

We download and read the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. Let’s see the dimensions of the data we’re dealing with:

library(data.table)
if (!dir.exists("data")) dir.create("data")
if (!file.exists("data/repdata-data-StormData.csv.bz2")) download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "data/repdata-data-StormData.csv.bz2")
storm.data <- fread("data/repdata-data-StormData.csv.bz2")
dim(storm.data)
## [1] 902297     37

Let’s have a look at the columns with the information we’re interested in:

dt <- storm.data[, c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
head(dt)
##     EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1: TORNADO          0       15    25.0          K       0           
## 2: TORNADO          0        0     2.5          K       0           
## 3: TORNADO          0        2    25.0          K       0           
## 4: TORNADO          0        2     2.5          K       0           
## 5: TORNADO          0        2     2.5          K       0           
## 6: TORNADO          0        6     2.5          K       0

Let’s look at damage magnitude (*DMGEXP) columns:

table(dt$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 424665      7  11330
table(dt$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994

What do these numbers mean? There aren’t many occurencies so let’s ignore everything that’s not H (hundreds), K (thousands), M (Millions) or B (Billions).

Let’s convert damage values to the same unit:

dt[toupper(PROPDMGEXP) == "H", Property.Damage.Usd:= PROPDMG * 100]
dt[toupper(PROPDMGEXP) == "K", Property.Damage.Usd:= PROPDMG * 1000]
dt[toupper(PROPDMGEXP) == "M", Property.Damage.Usd:= PROPDMG * 1000000]
dt[toupper(PROPDMGEXP) == "B", Property.Damage.Usd:= PROPDMG * 1000000000]

dt[toupper(CROPDMGEXP) == "H", Crop.Damage.Usd:= CROPDMG * 100]
dt[toupper(CROPDMGEXP) == "K", Crop.Damage.Usd:= CROPDMG * 1000]
dt[toupper(CROPDMGEXP) == "M", Crop.Damage.Usd:= CROPDMG * 1000000]
dt[toupper(CROPDMGEXP) == "B", Crop.Damage.Usd:= CROPDMG * 1000000000]

Let’s see what kinds of events we have. Are they nicelly formatted?

length(unique(dt$EVTYPE))
## [1] 985
head(sort(table(dt$EVTYPE), decreasing = T), 20)
## 
##                     HAIL                TSTM WIND        THUNDERSTORM WIND 
##                   288661                   219940                    82563 
##                  TORNADO              FLASH FLOOD                    FLOOD 
##                    60652                    54277                    25326 
##       THUNDERSTORM WINDS                HIGH WIND                LIGHTNING 
##                    20843                    20212                    15754 
##               HEAVY SNOW               HEAVY RAIN             WINTER STORM 
##                    15708                    11723                    11433 
##           WINTER WEATHER             FUNNEL CLOUD         MARINE TSTM WIND 
##                     7026                     6839                     6175 
## MARINE THUNDERSTORM WIND               WATERSPOUT              STRONG WIND 
##                     5812                     3796                     3566 
##     URBAN/SML STREAM FLD                 WILDFIRE 
##                     3392                     2761

There’s almost 1k different event types. We can see there’s mixed cases, special characters, and a lot of variation in spelling (TSTM WIND, THUNDERSTORM WIND and THUNDERSTORM WINDS).

We’re only interested in finding out the types of event that have had the highest impact, not in the precise numbers. For now let’s just upper case them and remove some special characters. Later on we’ll find out the ones with the biggest impact and fix them if needed.

dt$Event.Type <- as.factor(gsub("[^A-Z &/]+", "", toupper(dt$EVTYPE)))
length(unique(dt$Event.Type))
## [1] 775

Let’s see the 10 events that caused the most injuries:

calc.injuries.by.event <- function(dt) {
  total.injuries <- sum(dt$INJURIES)
  result <- dt[, .(Total=sum(INJURIES)), by = Event.Type]
  result[,Pct.Injuries:= Total / total.injuries * 100]
  result <- result[order(result$Pct.Injuries, decreasing = T)]
}

injuries.by.event <- calc.injuries.by.event(dt)
head(injuries.by.event, 10)
##            Event.Type Total Pct.Injuries
##  1:           TORNADO 91346   65.0019925
##  2:         TSTM WIND  6957    4.9506148
##  3:             FLOOD  6789    4.8310657
##  4:    EXCESSIVE HEAT  6525    4.6432028
##  5:         LIGHTNING  5230    3.7216782
##  6:              HEAT  2100    1.4943641
##  7:         ICE STORM  1975    1.4054139
##  8:       FLASH FLOOD  1777    1.2645167
##  9: THUNDERSTORM WIND  1488    1.0588637
## 10:              HAIL  1361    0.9684903

Let’s merge TSTM WIND and THUNDERSTORM WIND:

dt$Event.Type <- gsub("TSTM", "THUNDERSTORM", dt$Event.Type)
injuries.by.event <- calc.injuries.by.event(dt)
head(injuries.by.event, 10)
##            Event.Type Total Pct.Injuries
##  1:           TORNADO 91346   65.0019925
##  2: THUNDERSTORM WIND  8445    6.0094785
##  3:             FLOOD  6789    4.8310657
##  4:    EXCESSIVE HEAT  6525    4.6432028
##  5:         LIGHTNING  5230    3.7216782
##  6:              HEAT  2100    1.4943641
##  7:         ICE STORM  1975    1.4054139
##  8:       FLASH FLOOD  1777    1.2645167
##  9:              HAIL  1361    0.9684903
## 10:      WINTER STORM  1321    0.9400262

Now let’s look at top 10 fatalities causes:

calc.fatalities.by.event <- function(dt) {
  total.fatalities <- sum(dt$FATALITIES)
  result <- dt[, .(Total=sum(FATALITIES)), by = Event.Type]
  result[,Pct.Fatalities:= Total / total.fatalities * 100]
  result <- result[order(result$Pct.Fatalities, decreasing = T)]
}

fatalities.by.event <- calc.fatalities.by.event(dt)
head(fatalities.by.event, 10)
##            Event.Type Total Pct.Fatalities
##  1:           TORNADO  5633      37.193793
##  2:    EXCESSIVE HEAT  1903      12.565203
##  3:       FLASH FLOOD   978       6.457577
##  4:              HEAT   937       6.186860
##  5:         LIGHTNING   817       5.394520
##  6: THUNDERSTORM WIND   637       4.206009
##  7:             FLOOD   470       3.103334
##  8:       RIP CURRENT   368       2.429845
##  9:         HIGH WIND   248       1.637504
## 10:         AVALANCHE   224       1.479036

There are no duplicate events. We’ll leave it like this.

Let’s see economic consequences:

dt[(Property.Damage.Usd > 0 | Crop.Damage.Usd > 0), Economic.Damage.Usd:= ifelse(is.na(Property.Damage.Usd), 0, Property.Damage.Usd) + ifelse(is.na(Crop.Damage.Usd), 0, Crop.Damage.Usd)]

calc.economic.damage.by.event <- function(dt) {
  total.economic.damage <- sum(dt$Economic.Damage.Usd, na.rm = T)
  result <- dt[, .(Total=sum(Economic.Damage.Usd, na.rm = T)), by = Event.Type]
  result[,Pct.Damage:= Total / total.economic.damage * 100]
  result <- result[order(result$Pct.Damage, decreasing = T)]
}
economic.damage.by.event <- calc.economic.damage.by.event(dt)
head(economic.damage.by.event, 10)
##            Event.Type        Total Pct.Damage
##  1:             FLOOD 150319678250  31.551737
##  2: HURRICANE/TYPHOON  71913712800  15.094514
##  3:           TORNADO  57352113590  12.038070
##  4:       STORM SURGE  43323541000   9.093506
##  5:              HAIL  18758221670   3.937305
##  6:       FLASH FLOOD  17562128610   3.686248
##  7:           DROUGHT  15018672000   3.152383
##  8:         HURRICANE  14610229010   3.066652
##  9:       RIVER FLOOD  10148404500   2.130126
## 10:         ICE STORM   8967041310   1.882160

Let’s merge HURRICANE/TYPHOON and HURRICANE

dt$Event.Type <- gsub("^HURRICANE$", "HURRICANE/TYPHOON", dt$Event.Type)
economic.damage.by.event <- calc.economic.damage.by.event(dt)
head(economic.damage.by.event, 10)
##            Event.Type        Total Pct.Damage
##  1:             FLOOD 150319678250  31.551737
##  2: HURRICANE/TYPHOON  86523941810  18.161166
##  3:           TORNADO  57352113590  12.038070
##  4:       STORM SURGE  43323541000   9.093506
##  5:              HAIL  18758221670   3.937305
##  6:       FLASH FLOOD  17562128610   3.686248
##  7:           DROUGHT  15018672000   3.152383
##  8:       RIVER FLOOD  10148404500   2.130126
##  9:         ICE STORM   8967041310   1.882160
## 10: THUNDERSTORM WIND   8936932980   1.875841

Results

We found that tornados have been the most harmful events to population health, both in terms of injuries and in terms of fatalities

par(mar=c(4,4,2,0), mfrow = c(2, 1), oma = c(0, 0, 2, 0))

top.injuries.by.event <- injuries.by.event[1:5,]
barplot(top.injuries.by.event$Total / 1000, names.arg = top.injuries.by.event$Event.Type, cex.names=0.7)
title(ylab = "# of Injuries (k)")

top.fatalities.by.event <- fatalities.by.event[1:5,]
barplot(top.fatalities.by.event$Total / 1000, names.arg = top.fatalities.by.event$Event.Type, cex.names=0.7)
title(ylab = "# of Fatalities (k)")

mtext("Most harmful events to population health (1950-2011)", outer = TRUE, cex = 1.5)

We found that floods have had the greatest economic consequences

top.economic.damage.by.event <- economic.damage.by.event[1:5,]
barplot(top.economic.damage.by.event$Total / 1000000000, names.arg = top.economic.damage.by.event$Event.Type, cex.names=0.7)
title(main = "Events with greatest economic consequences (1950-2011)", ylab = "USD (Billions)")