Human and Economic Damage caused by Storms in the U.S.

Synopsis

We sift through the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database in look for insights as to which type of weather events are more harmful in terms of loss of life, injuries, and damage to crops or property. We find that, by a large margin, tornados cause the most fatalities and injuries. Crops are most affected by drought, although floods (including river floods) do not lag far behind in havoc-wreaking power. Floods are also the greatest cause of property damage, accounting for more than twice as much losses as the second worst event, hurricane/typhoon. Our results are based on publicly available data from the period 1950-2011, including recorded events for all U.S. counties. All data processing was done in R, a free programming language and software environment for statistical computing and graphics.

Data Processing

In this section we document in detail every data manipulation we performed in order to arrive at our conclusions.

We begin by loading some useful packages, setting the working directory, and reading the data into R (we use a mild overestimate for nrow in order to speed up the process without assuming that we know beforehand the exact number of rows):

library("dplyr")
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library("stringr")
library("tidyr")
library("ggplot2")
library("RColorBrewer")
setwd("/home/eduardo/datascience/coursera/5-repdata/PA2")
storms <- read.csv("repdata-data-StormData.csv.bz2", nrow = 903000)

The data set includes 902297 observations of 37 variables:

dim(storms)
## [1] 902297     37
names(storms)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

We'll process these data using functions from the dplyr package. First we select only those variables that we're interested in, which are:

There are some weird prefixes included in the data, different from K, M, or B (or nothing). Since we have no way of knowing what is meant by them, we regard them as NAs and eliminate those observations from the processed data.

The pref2num() function, defined below, takes a vector of (character) prefixes and returns a numeric vector, where each number corresponds to the value of the prefix. It recognizes K, M, and B (uppercase or lowercase); everything else is mapped to 1. We use this function to construct numeric columns containing property/crop damage directly in USD.

Event types in the original data are recorded sometimes in capital letters, producing some repetition. We convert all-capital event types into lowercase, capitalizing the first letter of every word. Trailing and leading white space is also removed. We have not attempted to fiddle with the data by, e.g., merging together categories like “Tornado” and “Hurricane/Typhoon”. We do have, however, replaced the “tstm” abbreviation with “Thunderstorm”, because it seemed clear to us that it made no sense to, e.g., count “Tstm wind” and “Thunderstorm wind” as two separate event types.

weirdprefs <- c(0:8, "-", "+", "?", "h", "H")

pref2num <- function(prefs) {
    num <- rep(1, length(prefs))
    num[prefs == "k" | prefs == "K"] <- 1e3
    num[prefs == "m" | prefs == "M"] <- 1e6
    num[prefs == "b" | prefs == "B"] <- 1e9
    num
}

## capwords() function from the R help page for
## Character Translation and Casefolding (see ?toupper)

capwords <- function(s, strict = FALSE) {
    cap <- function(s) {
        paste(toupper(substring(s, 1, 1)),
              {s <- substring(s, 2); if(strict) tolower(s) else s},
              sep = "", collapse = " " )
    }
    sapply(strsplit(s, split = " "), cap, USE.NAMES = !is.null(names(s)))
}

storms <- storms %>%
    ## Select only those columns we're interested in
    select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>%

    ## Eliminate observations with weird prefixes
    filter(! PROPDMGEXP %in% weirdprefs) %>%
    filter(! CROPDMGEXP %in% weirdprefs) %>%

    ## Build numeric columns for crop/property damage
    mutate(property = PROPDMG * pref2num(as.character(PROPDMGEXP))) %>%
    mutate(crop = CROPDMG * pref2num(as.character(CROPDMGEXP))) %>%

    ## Delete trailing and leading spaces in EVTYPE
    mutate(EVTYPE = str_trim(as.character(EVTYPE))) %>%

    ## Convert evtype to lowercase, capitalizing every word
    mutate(EVTYPE = capwords(EVTYPE, strict = TRUE)) %>%

    ## Find and replace some abbreviations used in evtype
    mutate(EVTYPE = sub("Tstm", "Thunderstorm", EVTYPE, ignore.case = TRUE)) %>%

    ## Rename columns to lowercase
    rename(evtype = EVTYPE, fatalities = FATALITIES, injuries = INJURIES) %>%

    ## Eliminate superseded columns
    select(-PROPDMG, -PROPDMGEXP, -CROPDMG, -CROPDMGEXP)

str(storms)
## 'data.frame':    901949 obs. of  5 variables:
##  $ evtype    : chr  "Tornado" "Tornado" "Tornado" "Tornado" ...
##  $ fatalities: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ injuries  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ property  : num  25000 2500 25000 2500 2500 2500 2500 2500 25000 25000 ...
##  $ crop      : num  0 0 0 0 0 0 0 0 0 0 ...

There are 877 unique weather events (“storms”). The next step is to group the data by event type, and compute the sum of all other variables for each event type.

storms <- storms %>%
    group_by(evtype) %>%
    summarise_each(funs(sum))

str(storms)
## Classes 'tbl_df', 'tbl' and 'data.frame':    877 obs. of  5 variables:
##  $ evtype    : chr  "?" "Abnormal Warmth" "Abnormally Dry" "Abnormally Wet" ...
##  $ fatalities: num  0 0 0 0 0 0 0 0 0 1 ...
##  $ injuries  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ property  : num  5000 0 0 0 0 ...
##  $ crop      : num  0 0 0 0 0 ...
##  - attr(*, "drop")= logi TRUE
summary(storms)
##     evtype            fatalities         injuries      
##  Length:877         Min.   :   0.00   Min.   :    0.0  
##  Class :character   1st Qu.:   0.00   1st Qu.:    0.0  
##  Mode  :character   Median :   0.00   Median :    0.0  
##                     Mean   :  17.26   Mean   :  160.2  
##                     3rd Qu.:   0.00   3rd Qu.:    0.0  
##                     Max.   :5630.00   Max.   :91321.0  
##     property              crop          
##  Min.   :0.000e+00   Min.   :0.000e+00  
##  1st Qu.:0.000e+00   1st Qu.:0.000e+00  
##  Median :0.000e+00   Median :0.000e+00  
##  Mean   :4.872e+08   Mean   :5.590e+07  
##  3rd Qu.:8.000e+04   3rd Qu.:0.000e+00  
##  Max.   :1.447e+11   Max.   :1.397e+10

From the summary of the data we learn that at least 75 percent of all “storms” produce no fatalities, no injuries, and no crop damage, and that at least 50 percent of all “storms” also produce no property damage.

Total numbers of fatalities, injuries, and crop/property damage are:

total_fatalities <- sum(storms$fatalities)
total_injuries <- sum(storms$injuries)
total_property <- sum(storms$property)
total_crop <- sum(storms$crop)

This gives 1.51 × 104 fatalities, 1.4 × 105 injuries, 4.27 × 1011 USD in property losses, and 4.9 × 1010 USD in crop damage.

Since we're only interested in those events that produce the most damage, we ruthlessly supress all observations where fatalities, injuries, and crop/property damage are all equal to zero.

storms <- storms %>%
    filter(fatalities != 0 | injuries != 0 | property != 0 | crop != 0)
summary(storms)
##     evtype            fatalities         injuries      
##  Length:432         Min.   :   0.00   Min.   :    0.0  
##  Class :character   1st Qu.:   0.00   1st Qu.:    0.0  
##  Mode  :character   Median :   0.00   Median :    0.0  
##                     Mean   :  35.03   Mean   :  325.2  
##                     3rd Qu.:   2.00   3rd Qu.:    2.0  
##                     Max.   :5630.00   Max.   :91321.0  
##     property              crop          
##  Min.   :0.000e+00   Min.   :0.000e+00  
##  1st Qu.:5.000e+03   1st Qu.:0.000e+00  
##  Median :8.750e+04   Median :0.000e+00  
##  Mean   :9.892e+08   Mean   :1.135e+08  
##  3rd Qu.:1.860e+06   3rd Qu.:1.000e+04  
##  Max.   :1.447e+11   Max.   :1.397e+10

We are finally left with 432 observations where at least one damage indicator is different from zero.

Results

Fatalities and Injuries

To assess which event types are more harmful to humans (meaning they cause more fatalities and injuries), we arrange the data according to fatalities and injuries (in descending order), which gives us a list of the worst events:

topfatalities <- arrange(storms, desc(fatalities))
topinjuries <- arrange(storms, desc(injuries))
print(topfatalities)
## Source: local data frame [432 x 5]
## 
##               evtype fatalities injuries     property       crop
## 1            Tornado       5630    91321  56936985483  364950110
## 2     Excessive Heat       1903     6525      7753700  492402000
## 3        Flash Flood        978     1777  16140861717 1420727100
## 4               Heat        937     2100      1797000  401461500
## 5          Lightning        816     5230    928659283   12092090
## 6  Thunderstorm Wind        637     8445   7976179582  968850400
## 7              Flood        470     6789 144657709807 5661968450
## 8        Rip Current        368      232         1000          0
## 9          High Wind        246     1137   5270046260  638571300
## 10         Avalanche        224      170      3721800          0
## ..               ...        ...      ...          ...        ...
print(topinjuries)
## Source: local data frame [432 x 5]
## 
##               evtype fatalities injuries     property       crop
## 1            Tornado       5630    91321  56936985483  364950110
## 2  Thunderstorm Wind        637     8445   7976179582  968850400
## 3              Flood        470     6789 144657709807 5661968450
## 4     Excessive Heat       1903     6525      7753700  492402000
## 5          Lightning        816     5230    928659283   12092090
## 6               Heat        937     2100      1797000  401461500
## 7          Ice Storm         89     1975   3944927810 5022110000
## 8        Flash Flood        978     1777  16140861717 1420727100
## 9               Hail         15     1358  15732261777 3000954453
## 10      Winter Storm        206     1321   6688497250   26944000
## ..               ...        ...      ...          ...        ...

By a large margin, tornados turn out to be the events that cause more fatalities and injuries. Tornados cause almost 3 times more fatalities than the second most fatal event, excessive heat. Tornados also cause about 10 times more injuries than the second most dangerous event, thunderstorm wind.

To represent the worst events in a plot, we first decide on a short list that includes the five more damaging events in terms of both fatalities and injuries:

tophumanevs <- unique(c(topfatalities$evtype[1:5], topinjuries$evtype[1:5]))
print(tophumanevs)
## [1] "Tornado"           "Excessive Heat"    "Flash Flood"      
## [4] "Heat"              "Lightning"         "Thunderstorm Wind"
## [7] "Flood"

Now we recast the data in a format that's more suitable for plotting:

tophuman <- storms %>%
    select(-property, -crop) %>%
    filter(evtype %in% tophumanevs) %>%
    mutate(fatalities = 100*fatalities/total_fatalities) %>%
    mutate(injuries = 100*injuries/total_injuries) %>%
    gather(Damage, Percentage, -evtype) %>%
    rename(Event = evtype) %>%
    mutate(Event = factor(Event), Damage = factor(Damage)) %>%
    arrange(desc(Percentage))
tophuman
## Source: local data frame [14 x 3]
## 
##                Event     Damage Percentage
## 1            Tornado   injuries  65.010109
## 2            Tornado fatalities  37.198546
## 3     Excessive Heat fatalities  12.573505
## 4        Flash Flood fatalities   6.461843
## 5               Heat fatalities   6.190948
## 6  Thunderstorm Wind   injuries   6.011874
## 7          Lightning fatalities   5.391477
## 8              Flood   injuries   4.832992
## 9     Excessive Heat   injuries   4.645054
## 10 Thunderstorm Wind fatalities   4.208788
## 11         Lightning   injuries   3.723162
## 12             Flood fatalities   3.105385
## 13              Heat   injuries   1.494960
## 14       Flash Flood   injuries   1.265021
str(tophuman)
## Classes 'tbl_df', 'tbl' and 'data.frame':    14 obs. of  3 variables:
##  $ Event     : Factor w/ 7 levels "Excessive Heat",..: 7 7 1 2 4 6 5 3 1 6 ...
##  $ Damage    : Factor w/ 2 levels "fatalities","injuries": 2 1 1 1 1 2 1 2 2 1 ...
##  $ Percentage: num  65.01 37.2 12.57 6.46 6.19 ...

The following stacked bar plot shows the fraction (percentage) of fatalities and injuries caused by the seven deadliest events. Note that these account for over 75 percent of all human damage.

## Change plot order of Events
tophuman$Event <- factor(tophuman$Event,
                  c("Tornado", "Excessive Heat", "Flash Flood", "Heat",
                    "Thunderstorm Wind", "Lightning", "Flood"))

## Plot
g <- ggplot(data = tophuman, aes(x = Damage, y = Percentage, fill = Event)) +
    geom_bar(stat = "identity") +
    labs(title = "Percentage of fatalities and injuries\ncaused by most catastrofic weather events") +
    labs(x = "", y = "") +
    ylim(0, 100) +
    scale_fill_manual(values = brewer.pal(length(tophumanevs), "Set1"))
print(g)

plot of chunk unnamed-chunk-11

Crop and Property Damage

To assess which event types are more harmful to the economy (meaning they cause large crop and property damage), we arrange the data according to property and crop (in descending order), which gives us a list of the worst events:

topproperty <- arrange(storms, desc(property))
topcrop <- arrange(storms, desc(crop))
print(topproperty)
## Source: local data frame [432 x 5]
## 
##               evtype fatalities injuries     property       crop
## 1              Flood        470     6789 144657709807 5661968450
## 2  Hurricane/typhoon         64     1275  69305840000 2607872800
## 3            Tornado       5630    91321  56936985483  364950110
## 4        Storm Surge         13       38  43323536000       5000
## 5        Flash Flood        978     1777  16140861717 1420727100
## 6               Hail         15     1358  15732261777 3000954453
## 7          Hurricane         61       46  11868319010 2741910000
## 8  Thunderstorm Wind        637     8445   7976179582  968850400
## 9     Tropical Storm         58      340   7703890550  678346000
## 10      Winter Storm        206     1321   6688497250   26944000
## ..               ...        ...      ...          ...        ...
print(topcrop)
## Source: local data frame [432 x 5]
## 
##               evtype fatalities injuries     property        crop
## 1            Drought          0        4   1046106000 13972566000
## 2              Flood        470     6789 144657709807  5661968450
## 3        River Flood          2        2   5118945500  5029459000
## 4          Ice Storm         89     1975   3944927810  5022110000
## 5               Hail         15     1358  15732261777  3000954453
## 6          Hurricane         61       46  11868319010  2741910000
## 7  Hurricane/typhoon         64     1275  69305840000  2607872800
## 8        Flash Flood        978     1777  16140861717  1420727100
## 9       Extreme Cold        162      231     67737400  1312973000
## 10      Frost/freeze          0        0     10480000  1094186000
## ..               ...        ...      ...          ...         ...

Interestingly, Floods and Droughts turn out to be the events that cause greater economic damage. Floods cause twice as much property damage as the second worst event, Hurricane/typhoon. In crop damage, however, Droughts cause 30 percent more losses than Floods and River Floods, the second and third worst crop killers, taken together.

To represent the worst events in a plot, we first decide on a short list that includes the five more damaging events in terms of both crop and economy damage:

topeconomyevs <- unique(c(topcrop$evtype[1:5], topproperty$evtype[1:5]))
print(topeconomyevs)
## [1] "Drought"           "Flood"             "River Flood"      
## [4] "Ice Storm"         "Hail"              "Hurricane/typhoon"
## [7] "Tornado"           "Storm Surge"       "Flash Flood"

Now we recast the data in a format that's more suitable for plotting:

topeconomy <- storms %>%
    select(-fatalities, -injuries) %>%
    filter(evtype %in% topeconomyevs) %>%
    mutate(property = 100*property/total_property) %>%
    mutate(crop = 100*crop/total_crop) %>%
    gather(Damage, Percentage, -evtype) %>%
    rename(Event = evtype) %>%
    mutate(Event = factor(Event), Damage = factor(Damage)) %>%
    arrange(desc(Percentage))
topeconomy
## Source: local data frame [18 x 3]
## 
##                Event   Damage   Percentage
## 1              Flood property 3.385252e+01
## 2            Drought     crop 2.849881e+01
## 3  Hurricane/typhoon property 1.621882e+01
## 4            Tornado property 1.332429e+01
## 5              Flood     crop 1.154830e+01
## 6        River Flood     crop 1.025822e+01
## 7          Ice Storm     crop 1.024323e+01
## 8        Storm Surge property 1.013849e+01
## 9               Hail     crop 6.120825e+00
## 10 Hurricane/typhoon     crop 5.319086e+00
## 11       Flash Flood property 3.777254e+00
## 12              Hail property 3.681634e+00
## 13       Flash Flood     crop 2.897752e+00
## 14       River Flood property 1.197926e+00
## 15         Ice Storm property 9.231845e-01
## 16           Tornado     crop 7.443618e-01
## 17           Drought property 2.448077e-01
## 18       Storm Surge     crop 1.019813e-05
str(topeconomy)
## Classes 'tbl_df', 'tbl' and 'data.frame':    18 obs. of  3 variables:
##  $ Event     : Factor w/ 9 levels "Drought","Flash Flood",..: 3 1 5 9 3 7 6 8 4 5 ...
##  $ Damage    : Factor w/ 2 levels "property","crop": 1 2 1 1 2 2 2 1 2 2 ...
##  $ Percentage: num  33.9 28.5 16.2 13.3 11.5 ...

The following stacked bar plot shows the fraction (percentage) of crop and property damage caused by the nine severest events. Note that these account for over 75 percent of all economic damage.

## Change plot order of Events
topeconomy$Event <- factor(topeconomy$Event,
                    c("Flood", "Drought", "Hurricane/typhoon", "Tornado",
                      "River Flood", "Ice Storm", "Storm Surge", "Hail",
                      "Flash Flood"))

## Plot
h <- ggplot(data = topeconomy, aes(x = Damage, y = Percentage, fill = Event)) +
    geom_bar(stat = "identity") +
    labs(title = "Percentage of property and crop damages\ncaused by most catastrofic weather events") +
    labs(x = "", y = "") +
    ylim(0, 100) +
    scale_fill_manual(values = brewer.pal(length(topeconomyevs), "Set1"))
print(h)

plot of chunk unnamed-chunk-15