Synopsis

The goal of this document is to analyze the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database and to address the following questions

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

The analysis starts with the download of the NOAA storm database, which is then processed evaluating the event types reported per each year. This leads to take into consideration only the data starting from 1993, since prior data did report only few event-types, and to re-map some event-types according to the NOAA documentation.

To analyze the economic value of event damages some elaboration is performed on the fields representing the magnitude of event damages economic values.

Data Processing

Loading libraries

library(data.table)
library(ggplot2)

Download the data

download.file('https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2',
    destfile='stormData.csv.bz2')

Read the data

We only read the needed fields * Event type ** EVTYPE * event impact economic value ** PROPDMG ** PROPDMGEXP ** CROPDMG ** CROPDMGEXP * event impact on population health ** FATALITIES ** INJURIES * date ** BGN_DATE

colClasses <- c('NULL', 'character', rep('NULL', 5), 'character', rep('NULL', 14), 'numeric', 'numeric', 'numeric', 'character', 'numeric', 'character',rep('NULL', 9))
dt <- read.table('stormData.csv.bz2', sep=',', header=TRUE, stringsAsFactors=FALSE, colClasses=colClasses)
dt <- data.table(dt)

Get the year of each measurement

dt$year <- as.numeric(format(as.Date(dt$BGN_DATE, format = "%m/%d/%Y"), "%Y"))

Data cleaning on EVTYPE

We take a look at the values in the event-type (EVTYPE) field

countByEventTypeYear <- dt[, .N, by=list(EVTYPE,year)]
countEventTypeByYear <- countByEventTypeYear[, .N, by=year]
countEventTypeByYear
##     year   N
##  1: 1950   1
##  2: 1951   1
##  3: 1952   1
##  4: 1953   1
##  5: 1954   1
##  6: 1955   3
##  7: 1956   3
##  8: 1957   3
##  9: 1958   3
## 10: 1959   3
## 11: 1960   3
## 12: 1961   3
## 13: 1962   3
## 14: 1963   3
## 15: 1964   3
## 16: 1965   3
## 17: 1966   3
## 18: 1967   3
## 19: 1968   3
## 20: 1969   3
## 21: 1970   3
## 22: 1971   3
## 23: 1972   3
## 24: 1973   3
## 25: 1974   3
## 26: 1975   3
## 27: 1976   3
## 28: 1977   3
## 29: 1978   3
## 30: 1979   3
## 31: 1980   3
## 32: 1981   3
## 33: 1982   3
## 34: 1983   3
## 35: 1984   3
## 36: 1985   3
## 37: 1986   3
## 38: 1987   3
## 39: 1988   3
## 40: 1989   3
## 41: 1990   3
## 42: 1991   3
## 43: 1992   3
## 44: 1995 387
## 45: 1994 267
## 46: 1993 160
## 47: 1996 228
## 48: 1997 170
## 49: 1998 126
## 50: 1999 121
## 51: 2000 112
## 52: 2001 122
## 53: 2002  99
## 54: 2003  51
## 55: 2004  38
## 56: 2005  46
## 57: 2006  50
## 58: 2007  46
## 59: 2008  46
## 60: 2009  46
## 61: 2010  46
## 62: 2011  46
##     year   N

We see that before 1993 the number of event type per year is very low, that’s beacaus NOAA was registering just a sub-set of the event types before 1993. We’ll be analyzing only the events after 1993, not to get biased values.

dt <- dt[year >= 1993]

According to the NOAA documentation there should be 48 event types. We see that for each year we have much more event types. Most probably there are unconsistent values in the field EVTYPE.

We’ll remap these values for the most significative values present starting from 1993. Let’s here assume that the most significative values of event types are those for which we have most fatalities starting from 1993.

eventsPerHealthImpact <- dt[, .(fatalities = sum(FATALITIES), injuries = sum(INJURIES)), by=list(EVTYPE)]
setkey(eventsPerHealthImpact, fatalities, injuries)
sumHealthEventsFatNum <- rep(0,nrow(eventsPerHealthImpact))
sumHealthEventsFatNum[1] <- eventsPerHealthImpact[nrow(eventsPerHealthImpact),fatalities]
for (i in 2:nrow(eventsPerHealthImpact)) {
    sumHealthEventsFatNum[i] <- sumHealthEventsFatNum[i-1] + eventsPerHealthImpact[nrow(eventsPerHealthImpact)-i+1,fatalities]
}
topEventsPerHealthImpactNum <- sum(sumHealthEventsFatNum <= 0.95 * sumHealthEventsFatNum[nrow(eventsPerHealthImpact)])

The list of event types to be re-mapped

significativeEventsPerHealth <- eventsPerHealthImpact[nrow(eventsPerHealthImpact):(nrow(eventsPerHealthImpact)-topEventsPerHealthImpactNum),]

noaaEventTypes <- c('Astronomical Low Tide', 'Avalanche', 'Blizzard', 'Coastal Flood', 'Cold/Wind Chill', 'Debris Flow', 'Dense Fog', 'Dense Smoke', 'Drought', 'Dust Devil', 'Dust Storm', 'Excessive Heat', 'Extreme Cold/Wind Chill', 'Flash Flood', 'Flood', 'Frost/Freeze', 'Funnel Cloud', 'Freezing Fog', 'Hail', 'Heat', 'Heavy Rain', 'Heavy Snow', 'High Surf', 'High Wind', 'Hurricane (Typhoon)', 'Ice Storm', 'Lake-Effect Snow', 'Lakeshore Flood', 'Lightning', 'Marine Hail', 'Marine High Wind', 'Marine Strong Wind', 'Marine Thunderstorm Wind', 'Rip Current', 'Seiche', 'Sleet', 'Storm Surge/Tide', 'Strong Wind', 'Thunderstorm Wind', 'Tornado', 'Tropical Depression', 'Tropical Storm', 'Tsunami', 'Volcanic Ash', 'Waterspout', 'Wildfire', 'Winter Storm', 'Winter Weather')

notMatchedHEventTypesIndex <- tolower(significativeEventsPerHealth$EVTYPE) %in% tolower(noaaEventTypes)
significativeEventsPerHealth[!notMatchedHEventTypesIndex]
##                         EVTYPE fatalities injuries
##  1:                  TSTM WIND        241     3631
##  2:               RIP CURRENTS        204      297
##  3:                  HEAT WAVE        172      309
##  4:               EXTREME COLD        160      231
##  5:               EXTREME HEAT         96      155
##  6:          HURRICANE/TYPHOON         64     1275
##  7:         THUNDERSTORM WINDS         64      908
##  8:                        FOG         62      734
##  9:                  HURRICANE         61       46
## 10:       HEAVY SURF/HIGH SURF         42       48
## 11:                  LANDSLIDE         38       52
## 12:                 HIGH WINDS         35      302
## 13:                       COLD         35       48
## 14:  UNSEASONABLY WARM AND DRY         29        0
## 15:       URBAN/SML STREAM FLD         28       79
## 16:         WINTER WEATHER/MIX         28       72
## 17: TORNADOES, TSTM WIND, HAIL         25        0

We remap the EVTYPE column values in the original data

dt[ EVTYPE == "TSTM WIND", EVTYPE := "THUNDERSTORM WIND"]
dt[ EVTYPE == "THUNDERSTORM WINDS",  EVTYPE := "THUNDERSTORM WIND"]

dt[ EVTYPE == "EXTREME COLD",  EVTYPE := "EXTREME COLD/WIND CHILL"]
dt[ EVTYPE == "EXTREME WINDCHILL",  EVTYPE := "EXTREME COLD/WIND CHILL"]    

dt[ EVTYPE == "EXCESSIVE HEAT",  EVTYPE := "EXTREME HEAT"]
dt[ EVTYPE == "RECORD/EXCESSIVE HEAT",  EVTYPE := "EXTREME HEAT"]

dt[ EVTYPE == "RIP CURRENTS",  EVTYPE := "RIP CURRENT"]
dt[ EVTYPE == "RIP CURRENTS/HEAVY SURF",  EVTYPE := "RIP CURRENT"]


dt[ EVTYPE == "HURRICANE",  EVTYPE := "HURRICANE (TYPHOON)"]
dt[ EVTYPE == "HURRICANE ERIN",  EVTYPE := "HURRICANE (TYPHOON)"]
dt[ EVTYPE == "HURRICANE/TYPHOON",  EVTYPE := "HURRICANE (TYPHOON)"]

dt[ EVTYPE == "FOG",  EVTYPE := "DENSE FOG"]

dt[ EVTYPE == "HEAVY SURF/HIGH SURF",  EVTYPE := "HEAVY SURF"]
dt[ EVTYPE == "ROUGH SURF",  EVTYPE := "HEAVY SURF"]

dt[ EVTYPE == "HIGH WINDS",  EVTYPE := "HIGH WIND"]
dt[ EVTYPE == "HIGH WIND/SEAS",  EVTYPE := "HIGH WIND"]
dt[ EVTYPE == "HIGH WIND AND SEAS",  EVTYPE := "HIGH WIND"]
dt[ EVTYPE == "HIGH WINDS/SNOW",  EVTYPE := "HIGH WIND"]

dt[ EVTYPE == "COLD",  EVTYPE := "COLD/WIND CHILL"]
dt[ EVTYPE == "COLD AND SNOW",  EVTYPE := "COLD/WIND CHILL"]
dt[ EVTYPE == "COLD WEATHER",  EVTYPE := "COLD/WIND CHILL"]

dt[ EVTYPE == "UNSEASONABLY WARM AND DRY",  EVTYPE := "UNSEASONABLY WARM"]

dt[ EVTYPE == "URBAN/SML STREAM FLD",  EVTYPE := "FLOOD"]
dt[ EVTYPE == "URBAN FLOOD",  EVTYPE := "FLOOD"]

dt[ EVTYPE == "WINTER WEATHER/MIX",  EVTYPE := "WINTER WEATHER"]

dt[ EVTYPE == "TORNADOES, TSTM WIND, HAIL",  EVTYPE := "TORNADO"]

dt[ EVTYPE == "STRONG WINDS",  EVTYPE := "STRONG WIND"]
dt[ EVTYPE == "WIND",  EVTYPE := "STRONG WIND"]

Calculate impact on health and economy

To calculate the ecomomic values of weather events consequences we are going to analyze the values * PROPDMG: property damage * PROPDMGEXP: property damage magnitude * CROPDMG: crop damage magnitude * CROPDMGEXP: crop damage magnitude

Let’s check the values in the PROPDMGEXP column

summary(as.factor(dt$PROPDMGEXP))
##             -      ?      +      0      1      2      3      4      5 
## 313139      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 392674      7   8557

similarly for the values in CROPDMGEXP

summary(as.factor(dt$CROPDMGEXP))
##             ?      0      2      B      k      K      m      M 
## 430854      7     19      1      9     21 281832      1   1994

To use these values we have to perform some computations according to the documentation, handling the numeric values in the “*EXP" fields as not having effect on the magnitude of the economic impact

computeExp <- Vectorize(function(key){
  if (is.character(key)) 
    key <- toupper(key)
  else
    key <- ''
  switch(key,
    B={10^9},
    M={10^6},
    K={10^3},
    H={10^2},
    {1})
}, 'key')
dt[,':=' (PROPDMG_VALUE = (PROPDMG*computeExp(PROPDMGEXP)), CROPDMG_VALUE = (CROPDMG*computeExp(CROPDMGEXP))) ]

We evaluate the economic value of these events as a sum of the property and crop damage

dt[ ,DAMAGE_VALUE := PROPDMG_VALUE+CROPDMG_VALUE ]

Now we can calculate the damage associate per each type of event type

eventsEconomicImpact <- dt[, .(EconomicImpact = sum(DAMAGE_VALUE)), by=list(EVTYPE)]
setkey(eventsEconomicImpact, EconomicImpact)

Results

Here we can see which types of events are most harmful with respect to population health

eventsPerHealthImpact$EVTYPE <- factor(eventsPerHealthImpact$EVTYPE, levels = eventsPerHealthImpact$EVTYPE[order(-eventsPerHealthImpact$fatalities)])
eHealth2 <- melt(eventsPerHealthImpact[.N:(.N-10)], id.vars = c('EVTYPE'), measure.vars = c('injuries','fatalities'))
g <- ggplot(data=eHealth2, mapping = aes(x=EVTYPE, y=value, fill=variable))
g <- g + geom_bar(position="dodge",stat="identity") 
g <- g + theme(axis.text.x = element_text(angle = 45, hjust = 1))
g <- g + ggtitle('Events with biggest impact on population health')
g <- g + labs( x='Event Type',  y='Thousands of persons involved', fill='Impact type')
g

Here we can see which types of events have the greatest economic consequences

eventsEconomicImpact$EVTYPE <- factor(eventsEconomicImpact$EVTYPE, levels = eventsEconomicImpact$EVTYPE[order(-eventsEconomicImpact$EconomicImpact)])

g <- ggplot(data=eventsEconomicImpact[.N:(.N-10)], mapping = aes(x=EVTYPE, y=EconomicImpact/10^9))
g <- g + geom_col(fill='darkblue')
g <- g + theme(axis.text.x = element_text(angle = 45, hjust = 1))
g <- g + ggtitle('Events with biggest economic impact')
g <- g + labs( x='Event Type',  y='Economic Impact in Bilions of $')
g