The analysis explore the NOAA storm database in attempt to understand the damage caused by severe weather events, both to human lives as well as the economic impacts. It is presented in such away that will facilitate the reader to understand how to prioritize resources to remediate some of the damages.

Synopsis

TORNADO catastrophes in central and eastern time zones regions of the US were the most harmful events, with approximate casualties of 90,000 people, out of them 5% are dead. Following are the FLOODS in central time zone, with 6,675 casualties, the third event relate to EXCESSIVE HEAT that causes 5,229 casualties, were 20% are dead. FLOODS that occur in the pacific time zone are the most costly events with accumulated economic damage of 119 billions dollars, were the majority of damage is to property. Next are the TORNADOS and HURRICANES happened in central time zone, with accumulated economic damage of 46 and 43 billions dollars respectively, with a majority of damage to property. We can conclude that the TORNADO is the most harmful and damaging event that ever happen in the US following by the FLOOD.

Questions

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

Data

Storm Data 47Mb

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

National Weather Service Storm Data Documentation

National Climatic Data Center Storm Events FAQ

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Data Processing

library(dplyr, quietly = TRUE, warn.conflicts = FALSE)
library(tidyr)
library(ggplot2)
library(knitr)
# download the data  
    datafile <- "repdata-data-StormData.csv"
    
    if (!file.exists(paste(datafile, ".bz2", sep = ""))) {
        download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", 
            destfile = paste(datafile, ".bz2", sep = ""), method = "curl")
    }
## Warning: running command 'curl
## "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
## -o "repdata-data-StormData.csv.bz2"' had status 127
## Warning in
## download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
## : download had nonzero exit status
# unzip and load the data   
    if (!file.exists(datafile)) {
        unzip(paste(datafile, ".bz2", sep = ""))
    }
    
    data <- read.csv(datafile)

The measured data has 902,297 observation of 37 variables.

Data processing - population health.

# subset the relevant features 
    fatal <- select(data, EVTYPE, TIME_ZONE, FATALITIES, INJURIES)

# get summary of features
    summary(fatal)
##                EVTYPE         TIME_ZONE        FATALITIES      
##  HAIL             :288661   CST    :547493   Min.   :  0.0000  
##  TSTM WIND        :219940   EST    :245558   1st Qu.:  0.0000  
##  THUNDERSTORM WIND: 82563   MST    : 68390   Median :  0.0000  
##  TORNADO          : 60652   PST    : 28302   Mean   :  0.0168  
##  FLASH FLOOD      : 54277   AST    :  6360   3rd Qu.:  0.0000  
##  FLOOD            : 25326   HST    :  2563   Max.   :583.0000  
##  (Other)          :170878   (Other):  3631                     
##     INJURIES        
##  Min.   :   0.0000  
##  1st Qu.:   0.0000  
##  Median :   0.0000  
##  Mean   :   0.1557  
##  3rd Qu.:   0.0000  
##  Max.   :1700.0000  
## 
# filter out observations without fatalities or injuries
    fatal <- filter(fatal, FATALITIES!=0 | INJURIES!=0)

# summarize the number of fatalities and injuries per event type and time zone
    fs <- group_by(fatal, EVTYPE, TIME_ZONE)
    fs <- summarise(fs, fatalities = sum(FATALITIES, na.rm = TRUE),
                       injuries = sum(INJURIES, na.rm = TRUE),
                       total = fatalities + injuries,
                       fatalities_pct = paste(round(fatalities/total*100, digits=0),"%", sep=""),
                       injuries_pct = paste(round(injuries/total*100, digits=0),"%", sep=""))

# order by total casualties and extract the 10 most harmful events
    fs <- head(fs[order(desc(fs$total)), ] , 10)

# reshape the data in order to produce a side by side column bar graph 
    fr <-
        select(fs,EVTYPE,TIME_ZONE, fatalities, injuries, total ) %>%
            gather(impact, affected_lifes, fatalities:injuries)
    fr <- arrange(fr, desc(total))

# ajust the feature names
    names(fs) <- c("event","time_zone","fatalities","injuries","total","fatalities_%","injuries_%")

Data processing - economic impact.

# subset the relevant features
    damage <- select(data, EVTYPE, TIME_ZONE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

# get summary
    summary(damage)
##                EVTYPE         TIME_ZONE         PROPDMG       
##  HAIL             :288661   CST    :547493   Min.   :   0.00  
##  TSTM WIND        :219940   EST    :245558   1st Qu.:   0.00  
##  THUNDERSTORM WIND: 82563   MST    : 68390   Median :   0.00  
##  TORNADO          : 60652   PST    : 28302   Mean   :  12.06  
##  FLASH FLOOD      : 54277   AST    :  6360   3rd Qu.:   0.50  
##  FLOOD            : 25326   HST    :  2563   Max.   :5000.00  
##  (Other)          :170878   (Other):  3631                    
##    PROPDMGEXP        CROPDMG          CROPDMGEXP    
##         :465934   Min.   :  0.000          :618413  
##  K      :424665   1st Qu.:  0.000   K      :281832  
##  M      : 11330   Median :  0.000   M      :  1994  
##  0      :   216   Mean   :  1.527   k      :    21  
##  B      :    40   3rd Qu.:  0.000   0      :    19  
##  5      :    28   Max.   :990.000   B      :     9  
##  (Other):    84                     (Other):     9
# filter out observations without damage
    damage <- filter(damage, PROPDMG!=0 && PROPDMGEXP %in% c("K","M", "B") | CROPDMG!=0 && CROPDMGEXP %in% c("K","M", "B"))

# scalling numbers
    scallingFactor <- function(damage, exponent) {
        if (exponent %in% c("b", "B")) {
            return(damage * 1000000000)
        }
        if (exponent %in% c("k", "K")) {
            return(damage * 1000)
        }
        if (exponent %in% c("m", "M")) {
            return(damage * 1000000)
        }
            return(0)
    }

# append scaled numbers 
    damage$prop_damage_scalling <-  mapply(scallingFactor, damage$PROPDMG, damage$PROPDMGEXP)
    damage$crop_damage_scalling <-  mapply(scallingFactor, damage$CROPDMG, damage$CROPDMGEXP)

# summarize the damage per event type and time zone
    dmg <- group_by(damage, EVTYPE, TIME_ZONE)
    dmg <- summarise(dmg, prop_damage = sum(prop_damage_scalling, na.rm = TRUE)/1000000000,
                     crop_damage = sum(crop_damage_scalling, na.rm = TRUE)/1000000000,
                     total = round(prop_damage + crop_damage, digits=2))

# order by highest damage and extract the 10 most damaging events
    dmg <- head(dmg[order(desc(dmg$total)), ] , 10)

# reshape the data in order to produce a side by side column bar graph 
    dmr <-
        select(dmg,EVTYPE,TIME_ZONE, prop_damage, crop_damage, total ) %>%
            gather(impact, damage_value, prop_damage:crop_damage)
    dmr <- arrange(dmr, desc(total))

# ajust the feature names
    names(dmg) <- c("event","time_zone","property_damage","crop_damage","total")

Results

population health

Table 1 - The most harmful events

kable(fs, digits=0)
event time_zone fatalities injuries total fatalities_% injuries_%
TORNADO CST 5180 81524 86704 6% 94%
TORNADO EST 440 9253 9693 5% 95%
FLOOD CST 158 6517 6675 2% 98%
EXCESSIVE HEAT CST 1040 4189 5229 20% 80%
TSTM WIND CST 334 4755 5089 7% 93%
LIGHTNING EST 389 3138 3527 11% 89%
EXCESSIVE HEAT EST 678 2076 2754 25% 75%
HEAT CST 800 1686 2486 32% 68%
ICE STORM EST 12 1737 1749 1% 99%
TSTM WIND EST 140 1604 1744 8% 92%

Chart 1 - Visulaize the most harmful events per time zone

ggplot(data=fr, aes(x=reorder(EVTYPE, order(total)), y=affected_lifes, fill=impact)) +
geom_bar(stat = "identity", position=position_dodge()) + 
coord_flip() + xlab("Event") + ylab("Casualties") + facet_wrap(~TIME_ZONE) 

economic impact

Table 2 - The most dameging events (billions $)

kable(dmg, digits=0)
event time_zone property_damage crop_damage total
FLOOD PST 118 1 119
TORNADO CST 48 0 48
HURRICANE/TYPHOON CST 46 2 48
STORM SURGE CST 43 0 43
HURRICANE/TYPHOON EST 23 1 24
FLOOD CST 15 3 18
DROUGHT CST 1 12 13
FLOOD EST 11 2 12
HAIL CST 9 2 11
RIVER FLOOD CST 5 5 10

Chart 2 - Visulaize the most dameging events per time zone

ggplot(data=dmr, aes(x=reorder(EVTYPE, total), y=damage_value, fill=impact)) + 
geom_bar(stat="identity", position=position_dodge()) +
coord_flip() + xlab("Event") + ylab("Economic damage - billions $") + facet_wrap(~TIME_ZONE)