Synopsis

In this report we aim to identify most harmful storms respect to population health and storms that has most economic consequences. The U.S. National Oceanic and Atmospheric Administration track different storm events across the United States and provide amount of death and injury within population, and property and crop damage per event. From data we investigated that Tornado has the most damage on population health: highest injury and death from this events. Tornado also has one of the biggest economic consequences, but the greatest consequences are from flood.

Data Processing

The raw data come form U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database for event between 1950 and November 2011. Raw data can be downloaded here. Data was originally downloaded at Jun 18 2014 21:54 (VLAT).

First of all let’s download and load file into R:

    if (!file.exists("data.csv.bz2")) {
        download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", 
                      "data.csv.bz2", method = "curl")
    }
    if (!file.exists("data.csv")) {
        R.utils::bunzip2("data.csv.bz2", remove = FALSE, overwrite = TRUE)
    }    
    data <- read.csv("data.csv",  na.strings = "")

This is a huge data-set (902297 rows by 37 cols):

    dim(data)
## [1] 902297     37

And we will use data.table package to fasten work with it:

    suppressPackageStartupMessages(require(data.table))
    data <- as.data.table(data)

Also for our investigation we need only this columns:

We’ll do some transformations to data:

    data[, PROPDMGEXP := toupper(PROPDMGEXP)]
    data[is.na(PROPDMGEXP), eco.prop.b := PROPDMG / 1e9]
    data[PROPDMGEXP == "H", eco.prop.b := PROPDMG / 1e7]
    data[PROPDMGEXP == "K", eco.prop.b := PROPDMG / 1e6]
    data[PROPDMGEXP =="M", eco.prop.b := PROPDMG / 1e3]
    data[PROPDMGEXP == "B", eco.prop.b := PROPDMG]
    data[is.na(eco.prop.b), eco.prop.b := PROPDMG / (10 ^ (9 - suppressWarnings(as.numeric(PROPDMGEXP))))]
    data[is.na(eco.prop.b), eco.prop.b := PROPDMG / 1e9]
    
    data[, CROPDMGEXP := toupper(CROPDMGEXP)]
    data[is.na(CROPDMGEXP), eco.crop.b := CROPDMG / 1e9]
    data[CROPDMGEXP == "H", eco.crop.b := CROPDMG / 1e7]
    data[CROPDMGEXP == "K", eco.crop.b := CROPDMG / 1e6]
    data[CROPDMGEXP =="M", eco.crop.b := CROPDMG / 1e3]
    data[CROPDMGEXP == "B", eco.crop.b := CROPDMG]
    data[is.na(eco.crop.b), eco.crop.b := CROPDMG / (10 ^ (9 - suppressWarnings(as.numeric(CROPDMGEXP))))]
    data[is.na(eco.crop.b), eco.crop.b := CROPDMG / 1e9]
    setnames(data, "EVTYPE", "event")
    setnames(data, "FATALITIES", "people.death")
    setnames(data, "INJURIES", "people.injur")
    data <- data[, list(event, people.death, people.injur, eco.prop.b, eco.crop.b)]

We need some transformations on event names: * Some events are in lower case we transform them:

    data[, event := toupper(event)]
    data[, event := gsub("[ ]?AND[ ]?|[ ]?[&][ ]?|[ ][-][ ]", "/", event)]
    data[, event := gsub("^[ ]*|[ ]$", "", event)]
    data[, event := gsub("[ ]+", " ", event)]
    data[, event := gsub("S$", "", event)]

It seems that there are a lot of typos, but it hard to correct, ex:

    data[, .N, by = event][order(event)][82:83, event]

We then aggregate data and remove not meaningful zeros:

    data.agg.eco <- data[, list(eco.prop.b.sum = sum(eco.prop.b, na.rm = TRUE), 
                                eco.crop.b.sum = sum(eco.crop.b, na.rm = TRUE)), 
                         by = event][eco.prop.b.sum > 0 | eco.crop.b.sum > 0]

    data.agg.people <- data[, list(people.death.sum = sum(people.death), 
                                   people.injur.sum = sum(people.injur)),
                            by = event][people.death.sum > 0 | people.injur.sum > 0]

Result

In this section we address events which are most harmful to population health and which has greatest economic consequences.

Most harmful storms with respect to population health

We have two storms effects regarding to population health: * death, that seems more harmful, * and injury.

Top 20 storms with respect to population health:

    head(data.agg.people[order(-people.death.sum, -people.injur.sum)], 20)
##                       event people.death.sum people.injur.sum
##  1:                 TORNADO             5633            91346
##  2:          EXCESSIVE HEAT             1903             6525
##  3:             FLASH FLOOD              980             1777
##  4:                    HEAT              937             2100
##  5:               LIGHTNING              816             5230
##  6:             RIP CURRENT              572              529
##  7:               TSTM WIND              504             6957
##  8:                   FLOOD              470             6789
##  9:               HIGH WIND              283             1439
## 10:               AVALANCHE              224              170
## 11:            WINTER STORM              216             1338
## 12:       THUNDERSTORM WIND              197             2406
## 13:               HEAT WAVE              177              379
## 14:            EXTREME COLD              162              231
## 15:              HEAVY SNOW              127             1021
## 16: EXTREME COLD/WIND CHILL              125               24
## 17:             STRONG WIND              111              301
## 18:               HIGH SURF              104              156
## 19:                BLIZZARD              101              805
## 20:              HEAVY RAIN               98              255

Since we have a lot of events we’ll group 21th, …, 190th:

    data.agg.people.group <- rbind(data.agg.people[order(-people.death.sum, -people.injur.sum)][1:20], 
                                   data.agg.people[order(-people.death.sum, -people.injur.sum)][21:nrow(data.agg.people), list(event = "OTHERS", people.death.sum = sum(people.death.sum), people.injur.sum = sum(people.injur.sum))])
    data.agg.people.group[, event := factor(event, event)]
    suppressPackageStartupMessages(require(reshape2))
    data.agg.people.group <- melt(data.agg.people.group, 
                                  id.vars = "event", 
                                  measure.vars = c("people.death.sum", "people.injur.sum"))
    data.agg.people.group[, variable := factor(variable, labels = c("Death", "Injury"))]
    suppressPackageStartupMessages(require(ggplot2))
    ggplot(data.agg.people.group, aes(x = reorder(event, -xtfrm(event)))) + 
        geom_bar(aes(y = value, fill = variable), 
                 stat = "identity", 
                 position = "dodge") + 
        coord_flip() + 
        scale_y_sqrt() + 
        labs(x = "Strom type", y = "Number of people hit", fill = "") + 
        theme_bw()

plot of chunk unnamed-chunk-15

Figure 1. Most harmful storms with respect to peoples health, x-axis uses square root scale

Storms with the greatest economic consequences

We have two storms effects on economics: * damages to property, * and damages to crop.

Top 20 storms with the greatest economic consequences, in billions of dollars:

    head(data.agg.eco[order(-eco.prop.b.sum - eco.crop.b.sum), 
                      list(event, 
                           eco.prop.b.sum = round(eco.prop.b.sum, 2), 
                           eco.crop.b.sum = round(eco.crop.b.sum, 2))], 20)
##                         event eco.prop.b.sum eco.crop.b.sum
##  1:                     FLOOD         144.66           5.66
##  2:         HURRICANE/TYPHOON          69.31           2.61
##  3:                   TORNADO          56.95           0.41
##  4:               STORM SURGE          43.32           0.00
##  5:                      HAIL          15.74           3.03
##  6:               FLASH FLOOD          16.83           1.42
##  7:                   DROUGHT           1.05          13.97
##  8:                 HURRICANE          11.87           2.74
##  9:               RIVER FLOOD           5.12           5.03
## 10:                 ICE STORM           3.94           5.02
## 11:            TROPICAL STORM           7.70           0.68
## 12:              WINTER STORM           6.69           0.03
## 13:                 HIGH WIND           5.88           0.68
## 14:         THUNDERSTORM WIND           5.43           0.61
## 15:                  WILDFIRE           4.87           0.30
## 16:                 TSTM WIND           4.49           0.55
## 17:          STORM SURGE/TIDE           4.64           0.00
## 18:            HURRICANE OPAL           3.17           0.02
## 19:          WILD/FOREST FIRE           3.00           0.11
## 20: HEAVY RAIN/SEVERE WEATHER           2.50           0.00

Since we have a lot of events we’ll group 21th, …, 362th:

    data.agg.eco.group <- rbind(data.agg.eco[order(-eco.prop.b.sum - eco.crop.b.sum)][1:20], 
                                data.agg.eco[order(-eco.prop.b.sum - eco.crop.b.sum)][21:nrow(data.agg.eco), list(event = "OTHERS", eco.prop.b.sum = sum(eco.prop.b.sum), eco.crop.b.sum = sum(eco.crop.b.sum))])
    data.agg.eco.group[, event := factor(event, event)]
    data.agg.eco.group <- melt(data.agg.eco.group, 
                               id.vars = "event", 
                               measure.vars = c("eco.prop.b.sum", "eco.crop.b.sum"))
    data.agg.eco.group[, variable := factor(variable, labels = c("Crop", "Property"))]
    ggplot(data.agg.eco.group, aes(x = reorder(event, -xtfrm(event)))) + 
        geom_bar(aes(y = value, fill = variable), 
                 stat = "identity", 
                 position = "dodge") + 
        coord_flip() + 
        scale_y_sqrt() + 
        labs(x = "Strom type", y = "Damage in billions of dollars", fill = "") + 
        theme_bw()

plot of chunk unnamed-chunk-18

Figure 2. Most harmful storms with respect to economic, x-axis uses square root scale

Software Environment

This analysis originally performed in a Linux environment:

sessionInfo()
## R version 3.1.0 (2014-04-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## 
## locale:
##  [1] LC_CTYPE=ru_RU.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=ru_RU.UTF-8        LC_COLLATE=ru_RU.UTF-8    
##  [5] LC_MONETARY=ru_RU.UTF-8    LC_MESSAGES=ru_RU.UTF-8   
##  [7] LC_PAPER=ru_RU.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=ru_RU.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_1.0.0    reshape2_1.4     data.table_1.9.2
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.2-4 digest_0.6.4     evaluate_0.5.5   formatR_0.10    
##  [5] grid_3.1.0       gtable_0.1.2     htmltools_0.2.4  knitr_1.6       
##  [9] labeling_0.2     MASS_7.3-33      munsell_0.4.2    plyr_1.8.1      
## [13] proto_0.3-10     Rcpp_0.11.1      rmarkdown_0.2.46 scales_0.2.4    
## [17] stringr_0.6.2    tools_3.1.0      yaml_2.1.11

We used some additional packages (all available from CRAN):