Analysis of weather events in USA, and the impact on population health & economy

The analysis explore the NOAA storm database in attempt to understand the damage caused by severe weather events, both to human lives as well as the economic impacts. It is presented in such away that will facilitate the reader to understand how to prioritize resources to remediate some of the damages.

Synopsis

TORNADO catastrophes in central and eastern time zones regions of the US were the most harmful events, with approximate casualties of 90,000 people, out of them 5% are dead. Following are the FLOODS in central time zone, with 6,675 casualties, the third event relate to EXCESSIVE HEAT that causes 5,229 casualties, were 20% are dead. FLOODS that occur in the pacific time zone are the most costly events with accumulated economic damage of 119 billions dollars, were the majority of damage is to property. Next are the TORNADOS and HURRICANES happened in central time zone, with accumulated economic damage of 46 and 43 billions dollars respectively, with a majority of damage to property. We can conclude that the TORNADO is the most harmful and damaging event that ever happen in the US following by the FLOOD.

Questions

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

Data

Storm Data 47Mb

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

National Weather Service Storm Data Documentation

National Climatic Data Center Storm Events FAQ

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Data Processing

library(dplyr, quietly = TRUE, warn.conflicts = FALSE)
library(tidyr)
library(ggplot2)
library(knitr)

# download the data  
    datafile <- "repdata-data-StormData.csv"
    
    if (!file.exists(paste(datafile, ".bz2", sep = ""))) {
        download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", 
            destfile = paste(datafile, ".bz2", sep = ""), method = "curl")
    }

## Warning: running command 'curl
## "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
## -o "repdata-data-StormData.csv.bz2"' had status 127

## Warning in
## download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
## : download had nonzero exit status

# unzip and load the data   
    if (!file.exists(datafile)) {
        unzip(paste(datafile, ".bz2", sep = ""))
    }
    
    data <- read.csv(datafile)

The measured data has 902,297 observation of 37 variables.

Data processing - population health.

# subset the relevant features 
    fatal <- select(data, EVTYPE, TIME_ZONE, FATALITIES, INJURIES)

# get summary of features
    summary(fatal)

##                EVTYPE         TIME_ZONE        FATALITIES      
##  HAIL             :288661   CST    :547493   Min.   :  0.0000  
##  TSTM WIND        :219940   EST    :245558   1st Qu.:  0.0000  
##  THUNDERSTORM WIND: 82563   MST    : 68390   Median :  0.0000  
##  TORNADO          : 60652   PST    : 28302   Mean   :  0.0168  
##  FLASH FLOOD      : 54277   AST    :  6360   3rd Qu.:  0.0000  
##  FLOOD            : 25326   HST    :  2563   Max.   :583.0000  
##  (Other)          :170878   (Other):  3631                     
##     INJURIES        
##  Min.   :   0.0000  
##  1st Qu.:   0.0000  
##  Median :   0.0000  
##  Mean   :   0.1557  
##  3rd Qu.:   0.0000  
##  Max.   :1700.0000  
##

# filter out observations without fatalities or injuries
    fatal <- filter(fatal, FATALITIES!=0 | INJURIES!=0)

# summarize the number of fatalities and injuries per event type and time zone
    fs <- group_by(fatal, EVTYPE, TIME_ZONE)
    fs <- summarise(fs, fatalities = sum(FATALITIES, na.rm = TRUE),
                       injuries = sum(INJURIES, na.rm = TRUE),
                       total = fatalities + injuries,
                       fatalities_pct = paste(round(fatalities/total*100, digits=0),"%", sep=""),
                       injuries_pct = paste(round(injuries/total*100, digits=0),"%", sep=""))

# order by total casualties and extract the 10 most harmful events
    fs <- head(fs[order(desc(fs$total)), ] , 10)

# reshape the data in order to produce a side by side column bar graph 
    fr <-
        select(fs,EVTYPE,TIME_ZONE, fatalities, injuries, total ) %>%
            gather(impact, affected_lifes, fatalities:injuries)
    fr <- arrange(fr, desc(total))

# ajust the feature names
    names(fs) <- c("event","time_zone","fatalities","injuries","total","fatalities_%","injuries_%")

Data processing - economic impact.

# subset the relevant features
    damage <- select(data, EVTYPE, TIME_ZONE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

# get summary
    summary(damage)

##                EVTYPE         TIME_ZONE         PROPDMG       
##  HAIL             :288661   CST    :547493   Min.   :   0.00  
##  TSTM WIND        :219940   EST    :245558   1st Qu.:   0.00  
##  THUNDERSTORM WIND: 82563   MST    : 68390   Median :   0.00  
##  TORNADO          : 60652   PST    : 28302   Mean   :  12.06  
##  FLASH FLOOD      : 54277   AST    :  6360   3rd Qu.:   0.50  
##  FLOOD            : 25326   HST    :  2563   Max.   :5000.00  
##  (Other)          :170878   (Other):  3631                    
##    PROPDMGEXP        CROPDMG          CROPDMGEXP    
##         :465934   Min.   :  0.000          :618413  
##  K      :424665   1st Qu.:  0.000   K      :281832  
##  M      : 11330   Median :  0.000   M      :  1994  
##  0      :   216   Mean   :  1.527   k      :    21  
##  B      :    40   3rd Qu.:  0.000   0      :    19  
##  5      :    28   Max.   :990.000   B      :     9  
##  (Other):    84                     (Other):     9

# filter out observations without damage
    damage <- filter(damage, PROPDMG!=0 && PROPDMGEXP %in% c("K","M", "B") | CROPDMG!=0 && CROPDMGEXP %in% c("K","M", "B"))

# scalling numbers
    scallingFactor <- function(damage, exponent) {
        if (exponent %in% c("b", "B")) {
            return(damage * 1000000000)
        }
        if (exponent %in% c("k", "K")) {
            return(damage * 1000)
        }
        if (exponent %in% c("m", "M")) {
            return(damage * 1000000)
        }
            return(0)
    }

# append scaled numbers 
    damage$prop_damage_scalling <-  mapply(scallingFactor, damage$PROPDMG, damage$PROPDMGEXP)
    damage$crop_damage_scalling <-  mapply(scallingFactor, damage$CROPDMG, damage$CROPDMGEXP)

# summarize the damage per event type and time zone
    dmg <- group_by(damage, EVTYPE, TIME_ZONE)
    dmg <- summarise(dmg, prop_damage = sum(prop_damage_scalling, na.rm = TRUE)/1000000000,
                     crop_damage = sum(crop_damage_scalling, na.rm = TRUE)/1000000000,
                     total = round(prop_damage + crop_damage, digits=2))

# order by highest damage and extract the 10 most damaging events
    dmg <- head(dmg[order(desc(dmg$total)), ] , 10)

# reshape the data in order to produce a side by side column bar graph 
    dmr <-
        select(dmg,EVTYPE,TIME_ZONE, prop_damage, crop_damage, total ) %>%
            gather(impact, damage_value, prop_damage:crop_damage)
    dmr <- arrange(dmr, desc(total))

# ajust the feature names
    names(dmg) <- c("event","time_zone","property_damage","crop_damage","total")

Results

population health

Table 1 - The most harmful events

kable(fs, digits=0)

event	time_zone	fatalities	injuries	total	fatalities_%	injuries_%
TORNADO	CST	5180	81524	86704	6%	94%
TORNADO	EST	440	9253	9693	5%	95%
FLOOD	CST	158	6517	6675	2%	98%
EXCESSIVE HEAT	CST	1040	4189	5229	20%	80%
TSTM WIND	CST	334	4755	5089	7%	93%
LIGHTNING	EST	389	3138	3527	11%	89%
EXCESSIVE HEAT	EST	678	2076	2754	25%	75%
HEAT	CST	800	1686	2486	32%	68%
ICE STORM	EST	12	1737	1749	1%	99%
TSTM WIND	EST	140	1604	1744	8%	92%

Chart 1 - Visulaize the most harmful events per time zone

ggplot(data=fr, aes(x=reorder(EVTYPE, order(total)), y=affected_lifes, fill=impact)) +
geom_bar(stat = "identity", position=position_dodge()) + 
coord_flip() + xlab("Event") + ylab("Casualties") + facet_wrap(~TIME_ZONE)

economic impact

Table 2 - The most dameging events (billions $)

kable(dmg, digits=0)

event	time_zone	property_damage	crop_damage	total
FLOOD	PST	118	1	119
TORNADO	CST	48	0	48
HURRICANE/TYPHOON	CST	46	2	48
STORM SURGE	CST	43	0	43
HURRICANE/TYPHOON	EST	23	1	24
FLOOD	CST	15	3	18
DROUGHT	CST	1	12	13
FLOOD	EST	11	2	12
HAIL	CST	9	2	11
RIVER FLOOD	CST	5	5	10

Chart 2 - Visulaize the most dameging events per time zone

ggplot(data=dmr, aes(x=reorder(EVTYPE, total), y=damage_value, fill=impact)) + 
geom_bar(stat="identity", position=position_dodge()) +
coord_flip() + xlab("Event") + ylab("Economic damage - billions $") + facet_wrap(~TIME_ZONE)