The goal of this document is to analyze the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database and to address the following questions
The analysis starts with the download of the NOAA storm database, which is then processed evaluating the event types reported per each year. This leads to take into consideration only the data starting from 1993, since prior data did report only few event-types, and to re-map some event-types according to the NOAA documentation.
To analyze the economic value of event damages some elaboration is performed on the fields representing the magnitude of event damages economic values.
library(data.table)
library(ggplot2)
download.file('https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2',
destfile='stormData.csv.bz2')
We only read the needed fields * Event type ** EVTYPE * event impact economic value ** PROPDMG ** PROPDMGEXP ** CROPDMG ** CROPDMGEXP * event impact on population health ** FATALITIES ** INJURIES * date ** BGN_DATE
colClasses <- c('NULL', 'character', rep('NULL', 5), 'character', rep('NULL', 14), 'numeric', 'numeric', 'numeric', 'character', 'numeric', 'character',rep('NULL', 9))
dt <- read.table('stormData.csv.bz2', sep=',', header=TRUE, stringsAsFactors=FALSE, colClasses=colClasses)
dt <- data.table(dt)
Get the year of each measurement
dt$year <- as.numeric(format(as.Date(dt$BGN_DATE, format = "%m/%d/%Y"), "%Y"))
We take a look at the values in the event-type (EVTYPE) field
countByEventTypeYear <- dt[, .N, by=list(EVTYPE,year)]
countEventTypeByYear <- countByEventTypeYear[, .N, by=year]
countEventTypeByYear
## year N
## 1: 1950 1
## 2: 1951 1
## 3: 1952 1
## 4: 1953 1
## 5: 1954 1
## 6: 1955 3
## 7: 1956 3
## 8: 1957 3
## 9: 1958 3
## 10: 1959 3
## 11: 1960 3
## 12: 1961 3
## 13: 1962 3
## 14: 1963 3
## 15: 1964 3
## 16: 1965 3
## 17: 1966 3
## 18: 1967 3
## 19: 1968 3
## 20: 1969 3
## 21: 1970 3
## 22: 1971 3
## 23: 1972 3
## 24: 1973 3
## 25: 1974 3
## 26: 1975 3
## 27: 1976 3
## 28: 1977 3
## 29: 1978 3
## 30: 1979 3
## 31: 1980 3
## 32: 1981 3
## 33: 1982 3
## 34: 1983 3
## 35: 1984 3
## 36: 1985 3
## 37: 1986 3
## 38: 1987 3
## 39: 1988 3
## 40: 1989 3
## 41: 1990 3
## 42: 1991 3
## 43: 1992 3
## 44: 1995 387
## 45: 1994 267
## 46: 1993 160
## 47: 1996 228
## 48: 1997 170
## 49: 1998 126
## 50: 1999 121
## 51: 2000 112
## 52: 2001 122
## 53: 2002 99
## 54: 2003 51
## 55: 2004 38
## 56: 2005 46
## 57: 2006 50
## 58: 2007 46
## 59: 2008 46
## 60: 2009 46
## 61: 2010 46
## 62: 2011 46
## year N
We see that before 1993 the number of event type per year is very low, that’s beacaus NOAA was registering just a sub-set of the event types before 1993. We’ll be analyzing only the events after 1993, not to get biased values.
dt <- dt[year >= 1993]
According to the NOAA documentation there should be 48 event types. We see that for each year we have much more event types. Most probably there are unconsistent values in the field EVTYPE.
We’ll remap these values for the most significative values present starting from 1993. Let’s here assume that the most significative values of event types are those for which we have most fatalities starting from 1993.
eventsPerHealthImpact <- dt[, .(fatalities = sum(FATALITIES), injuries = sum(INJURIES)), by=list(EVTYPE)]
setkey(eventsPerHealthImpact, fatalities, injuries)
sumHealthEventsFatNum <- rep(0,nrow(eventsPerHealthImpact))
sumHealthEventsFatNum[1] <- eventsPerHealthImpact[nrow(eventsPerHealthImpact),fatalities]
for (i in 2:nrow(eventsPerHealthImpact)) {
sumHealthEventsFatNum[i] <- sumHealthEventsFatNum[i-1] + eventsPerHealthImpact[nrow(eventsPerHealthImpact)-i+1,fatalities]
}
topEventsPerHealthImpactNum <- sum(sumHealthEventsFatNum <= 0.95 * sumHealthEventsFatNum[nrow(eventsPerHealthImpact)])
The list of event types to be re-mapped
significativeEventsPerHealth <- eventsPerHealthImpact[nrow(eventsPerHealthImpact):(nrow(eventsPerHealthImpact)-topEventsPerHealthImpactNum),]
noaaEventTypes <- c('Astronomical Low Tide', 'Avalanche', 'Blizzard', 'Coastal Flood', 'Cold/Wind Chill', 'Debris Flow', 'Dense Fog', 'Dense Smoke', 'Drought', 'Dust Devil', 'Dust Storm', 'Excessive Heat', 'Extreme Cold/Wind Chill', 'Flash Flood', 'Flood', 'Frost/Freeze', 'Funnel Cloud', 'Freezing Fog', 'Hail', 'Heat', 'Heavy Rain', 'Heavy Snow', 'High Surf', 'High Wind', 'Hurricane (Typhoon)', 'Ice Storm', 'Lake-Effect Snow', 'Lakeshore Flood', 'Lightning', 'Marine Hail', 'Marine High Wind', 'Marine Strong Wind', 'Marine Thunderstorm Wind', 'Rip Current', 'Seiche', 'Sleet', 'Storm Surge/Tide', 'Strong Wind', 'Thunderstorm Wind', 'Tornado', 'Tropical Depression', 'Tropical Storm', 'Tsunami', 'Volcanic Ash', 'Waterspout', 'Wildfire', 'Winter Storm', 'Winter Weather')
notMatchedHEventTypesIndex <- tolower(significativeEventsPerHealth$EVTYPE) %in% tolower(noaaEventTypes)
significativeEventsPerHealth[!notMatchedHEventTypesIndex]
## EVTYPE fatalities injuries
## 1: TSTM WIND 241 3631
## 2: RIP CURRENTS 204 297
## 3: HEAT WAVE 172 309
## 4: EXTREME COLD 160 231
## 5: EXTREME HEAT 96 155
## 6: HURRICANE/TYPHOON 64 1275
## 7: THUNDERSTORM WINDS 64 908
## 8: FOG 62 734
## 9: HURRICANE 61 46
## 10: HEAVY SURF/HIGH SURF 42 48
## 11: LANDSLIDE 38 52
## 12: HIGH WINDS 35 302
## 13: COLD 35 48
## 14: UNSEASONABLY WARM AND DRY 29 0
## 15: URBAN/SML STREAM FLD 28 79
## 16: WINTER WEATHER/MIX 28 72
## 17: TORNADOES, TSTM WIND, HAIL 25 0
We remap the EVTYPE column values in the original data
dt[ EVTYPE == "TSTM WIND", EVTYPE := "THUNDERSTORM WIND"]
dt[ EVTYPE == "THUNDERSTORM WINDS", EVTYPE := "THUNDERSTORM WIND"]
dt[ EVTYPE == "EXTREME COLD", EVTYPE := "EXTREME COLD/WIND CHILL"]
dt[ EVTYPE == "EXTREME WINDCHILL", EVTYPE := "EXTREME COLD/WIND CHILL"]
dt[ EVTYPE == "EXCESSIVE HEAT", EVTYPE := "EXTREME HEAT"]
dt[ EVTYPE == "RECORD/EXCESSIVE HEAT", EVTYPE := "EXTREME HEAT"]
dt[ EVTYPE == "RIP CURRENTS", EVTYPE := "RIP CURRENT"]
dt[ EVTYPE == "RIP CURRENTS/HEAVY SURF", EVTYPE := "RIP CURRENT"]
dt[ EVTYPE == "HURRICANE", EVTYPE := "HURRICANE (TYPHOON)"]
dt[ EVTYPE == "HURRICANE ERIN", EVTYPE := "HURRICANE (TYPHOON)"]
dt[ EVTYPE == "HURRICANE/TYPHOON", EVTYPE := "HURRICANE (TYPHOON)"]
dt[ EVTYPE == "FOG", EVTYPE := "DENSE FOG"]
dt[ EVTYPE == "HEAVY SURF/HIGH SURF", EVTYPE := "HEAVY SURF"]
dt[ EVTYPE == "ROUGH SURF", EVTYPE := "HEAVY SURF"]
dt[ EVTYPE == "HIGH WINDS", EVTYPE := "HIGH WIND"]
dt[ EVTYPE == "HIGH WIND/SEAS", EVTYPE := "HIGH WIND"]
dt[ EVTYPE == "HIGH WIND AND SEAS", EVTYPE := "HIGH WIND"]
dt[ EVTYPE == "HIGH WINDS/SNOW", EVTYPE := "HIGH WIND"]
dt[ EVTYPE == "COLD", EVTYPE := "COLD/WIND CHILL"]
dt[ EVTYPE == "COLD AND SNOW", EVTYPE := "COLD/WIND CHILL"]
dt[ EVTYPE == "COLD WEATHER", EVTYPE := "COLD/WIND CHILL"]
dt[ EVTYPE == "UNSEASONABLY WARM AND DRY", EVTYPE := "UNSEASONABLY WARM"]
dt[ EVTYPE == "URBAN/SML STREAM FLD", EVTYPE := "FLOOD"]
dt[ EVTYPE == "URBAN FLOOD", EVTYPE := "FLOOD"]
dt[ EVTYPE == "WINTER WEATHER/MIX", EVTYPE := "WINTER WEATHER"]
dt[ EVTYPE == "TORNADOES, TSTM WIND, HAIL", EVTYPE := "TORNADO"]
dt[ EVTYPE == "STRONG WINDS", EVTYPE := "STRONG WIND"]
dt[ EVTYPE == "WIND", EVTYPE := "STRONG WIND"]
To calculate the ecomomic values of weather events consequences we are going to analyze the values * PROPDMG: property damage * PROPDMGEXP: property damage magnitude * CROPDMG: crop damage magnitude * CROPDMGEXP: crop damage magnitude
Let’s check the values in the PROPDMGEXP column
summary(as.factor(dt$PROPDMGEXP))
## - ? + 0 1 2 3 4 5
## 313139 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 392674 7 8557
similarly for the values in CROPDMGEXP
summary(as.factor(dt$CROPDMGEXP))
## ? 0 2 B k K m M
## 430854 7 19 1 9 21 281832 1 1994
To use these values we have to perform some computations according to the documentation, handling the numeric values in the “*EXP" fields as not having effect on the magnitude of the economic impact
computeExp <- Vectorize(function(key){
if (is.character(key))
key <- toupper(key)
else
key <- ''
switch(key,
B={10^9},
M={10^6},
K={10^3},
H={10^2},
{1})
}, 'key')
dt[,':=' (PROPDMG_VALUE = (PROPDMG*computeExp(PROPDMGEXP)), CROPDMG_VALUE = (CROPDMG*computeExp(CROPDMGEXP))) ]
We evaluate the economic value of these events as a sum of the property and crop damage
dt[ ,DAMAGE_VALUE := PROPDMG_VALUE+CROPDMG_VALUE ]
Now we can calculate the damage associate per each type of event type
eventsEconomicImpact <- dt[, .(EconomicImpact = sum(DAMAGE_VALUE)), by=list(EVTYPE)]
setkey(eventsEconomicImpact, EconomicImpact)
Here we can see which types of events are most harmful with respect to population health
eventsPerHealthImpact$EVTYPE <- factor(eventsPerHealthImpact$EVTYPE, levels = eventsPerHealthImpact$EVTYPE[order(-eventsPerHealthImpact$fatalities)])
eHealth2 <- melt(eventsPerHealthImpact[.N:(.N-10)], id.vars = c('EVTYPE'), measure.vars = c('injuries','fatalities'))
g <- ggplot(data=eHealth2, mapping = aes(x=EVTYPE, y=value, fill=variable))
g <- g + geom_bar(position="dodge",stat="identity")
g <- g + theme(axis.text.x = element_text(angle = 45, hjust = 1))
g <- g + ggtitle('Events with biggest impact on population health')
g <- g + labs( x='Event Type', y='Thousands of persons involved', fill='Impact type')
g
Here we can see which types of events have the greatest economic consequences
eventsEconomicImpact$EVTYPE <- factor(eventsEconomicImpact$EVTYPE, levels = eventsEconomicImpact$EVTYPE[order(-eventsEconomicImpact$EconomicImpact)])
g <- ggplot(data=eventsEconomicImpact[.N:(.N-10)], mapping = aes(x=EVTYPE, y=EconomicImpact/10^9))
g <- g + geom_col(fill='darkblue')
g <- g + theme(axis.text.x = element_text(angle = 45, hjust = 1))
g <- g + ggtitle('Events with biggest economic impact')
g <- g + labs( x='Event Type', y='Economic Impact in Bilions of $')
g