Injuries and Economic Damage Caused by Extreme Weather Events in the USA

The report provides average health harm and economic damage caused by different extreme weather events in the USA. The data set used for investigation is the US National Oceanic and Atmospheric Administration’s storm database (related storm data documentation).

The most harmful weather event for human health is tornado F5 (F5 stands for tornado category according to Fujita scale (F‑Scale)). An average F5 tornado causes one hundred injuries among which 9% are fatal cases. The biggest economic consequences are caused by hurricanes. An average hurricane causes over 400 millions dollars of economic damage. Tornado F5 in this case takes 2nd place with over than 130 millions of economic damage. The economic impact values are calculated by 2016 year’s US dollar buying power equivalent.

Data Processing

The storm database mentioned above is used for the analysis (last successful download: Feb 20th, 2017, 9:20 EET). There are several steps performed to preprocess and analyse the data set:

Necessary data (variables) reading
Data set cleaning
Injury and damage values calculating
Obtained data analysis

Stage 1. Necessary Data (Variables) Reading

The storm database is a large .csv file compressed into .bz2 archive. It contains 37 variables and has near a million records. The whole data set loaded into R takes over 400 Mb of RAM.

There is no need to load all variables into memory. The needed ones are:

BGN_DATE: The date when a given weather event (EVTYPE variable) begin. The year value of each record of BGN_DATE variable is used to determine annual Consumer Price Index (CPI) with a view to convert economic damage of a given weather event’s year to 2016 year equivalent.
EVTYPE: Weather event type.
F: Tornadoes category by Fujita scale.
FATALITIES: Deaths caused by a given EVTYPE.
INJURIES: Injuries caused by a given EVTYPE.
PROPDMG: Property damage caused by a given EVTYPE. Actual value depends on PROPDMGEXP value.
PROPDMGEXP: Determines PROPDMG absolute value (in most cases), i.e.
PropertyDamage = PROPDMG * 10^PROPDMGEXP.
CROPDMG: Crop damage caused by a given EVTYPE. Actual value depends on CROPDMGEXP value.
CROPDMGEXP: Determines CROPDMG absolute value (in moste cases), i.e.
CropDamage = CROPDMG * 10^CROPDMGEXP.

storm_data <-
    read.table(
        file = "repdata_data_StormData.csv.bz2",
        header = TRUE,
        sep = ",",
        na.strings = NULL,
        colClasses = c(
            "NULL", "character", rep("NULL", 5), "factor", rep("NULL", 12),
            "factor", "NULL", rep("numeric", 3), "factor", "numeric", "factor",
            rep("NULL", 9)
        )
    )

str(storm_data)

## 'data.frame':    902297 obs. of  9 variables:
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ EVTYPE    : Factor w/ 985 levels "?","ABNORMALLY DRY",..: 830 830 830 830 830 830 830 830 830 830 ...
##  $ F         : Factor w/ 7 levels "","0","1","2",..: 5 4 4 4 4 4 4 3 5 5 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...

Stage 2. Data Set Cleaning

There are 48 official event types according to section 2.1.1 Storm Data Event Table of Storm Data Preparation document, link to which is mentioned at the beginning (last successful download: Feb 20th, 2017, 9:20 EET). However storm_data data frame contains

library(dplyr)

unique(storm_data$EVTYPE) %>% length()

## [1] 985

unique event types, which are basically:

typos;
official event types with additional text added;
several (2 and 3) official event types recorded in a single row;
event types that are not listed in official event types.

unique(storm_data$EVTYPE) %>% head(20)

##  [1] TORNADO                   TSTM WIND                
##  [3] HAIL                      FREEZING RAIN            
##  [5] SNOW                      ICE STORM/FLASH FLOOD    
##  [7] SNOW/ICE                  WINTER STORM             
##  [9] HURRICANE OPAL/HIGH WINDS THUNDERSTORM WINDS       
## [11] RECORD COLD               HURRICANE ERIN           
## [13] HURRICANE OPAL            HEAVY RAIN               
## [15] LIGHTNING                 THUNDERSTORM WIND        
## [17] DENSE FOG                 RIP CURRENT              
## [19] THUNDERSTORM WINS         FLASH FLOOD              
## 985 Levels: ? ABNORMALLY DRY ABNORMALLY WET ... WND

This report implies official event types analysing.

## Create vector which lists all official event types
official_events <-
    factor(
        c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood",
            "Cold/Wind Chill", "Debris Flow", "Dense Fog", "Dense Smoke",
            "Drought", "Dust Devil", "Dust Storm", "Excessive Heat",
            "Extreme Cold/Wind Chill", "Flash Flood", "Flood", "Frost/Freeze",
            "Funnel Cloud", "Freezing Fog", "Hail", "Heat", "Heavy Rain",
            "Heavy Snow", "High Surf", "High Wind", "Hurricane (Typhoon)",
            "Ice Storm", "Lake-Effect Snow", "Lakeshore Flood", "Lightning",
            "Marine Hail", "Marine High Wind", "Marine Strong Wind",
            "Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet",
            "Storm Surge/Tide", "Strong Wind", "Thunderstorm Wind", "Tornado",
            "Tropical Depression", "Tropical Storm", "Tsunami", "Volcanic Ash",
            "Waterspout", "Wildfire", "Winter Storm", "Winter Weather")
    )

The Stage 2 section task is to determine as much as possible official event types. Here is the vector of all unique event types (called unique_events in opposite to official_events vector):

## The vector will be reduced each time some of its events are filtered
## step by step. So at the end of stage 2 the vector will be less than at
## the beginning.
unique_events <- unique(storm_data$EVTYPE)

2.1 Determine unique event types that have exact match with official event types

First of all there might be some official event types which have (almost) exact matching with some of the unique event types:

matched_events <-
    sapply(
        ## Every official event is surrounded by regex (i.e. forms a regex)
        paste(
            ## Unique event name may begin with non character symbols
            "^[^a-z]*",
            official_events,
            ## After unique event name there may be (in the presented order):
            ## - Word ending 's', 'es' or 'ing'.
            ## - Any quantity of non character symbols: spaces in most cases,
            ##   but there also may be '-', '/' etc.
            ## - Any single letter: some unique events are marked with a single
            ##   letter.
            ## - Any quantity of non character symbols: in most cases these are
            ##   numbers.
            ## - "mph": this concerns to thunderstorm wind event.
            ## - Any quantity of non character symbols: in most cases this is ')'.
            "(s|es|ing)?[^a-z]*[a-z]?[^a-z]*(mph)?[^a-z]*$",
            sep = ""
        ),
        FUN = grepl,
        unique_events,
        ignore.case = TRUE
    )

dim(matched_events)

## [1] 985  48

Last code snippet produced logical matrix where each row represents a single unique event type and each column — single official event type. TRUE values in this matrix mean that a given unique event type (row) matched with a given official event type (column).

Let’s see if there any matches:

## Iterate through rows (unique event types) to find matching with official
## event type
apply(matched_events, MARGIN = 1, FUN = function(matrix_row) any(matrix_row)) %>%
    ## Retrieve only TRUEs
    .[.] %>%
    length()

## [1] 160

So matched values need to be retrieved. Let’s create a data frame which will contain unique event types that matched with official ones:

## Data frame which contains sorted unique event types related to
## official event types
sorted_events <-
    data.frame(
        official_event = factor(levels = official_events),
        unique_event = factor(levels = levels(storm_data$EVTYPE)),
        ## A lot of unique event types consist of several official event types
        ## (e.g. unique event 'ICE STORM/FLASH FLOOD' consists of two
        ## official event types: Ice Storm and Flash Flood).
        ## The variable declared describes how many official events are included
        ## in a given unique_event. This value will serve as damage and injuries
        ## divider to split these values among official event types
        events_involved = integer(),
        stringsAsFactors = FALSE
    )

Let’s fill sorted_events with matched values:

library(magrittr)

## Iterate through 'matched_events' matrix's rows to fill in 'sorted_events'
## data frame with exatly matched unique event names
for (i in 1:nrow(matched_events))
{
    ## Index of 'official_events' vector which matched with record from
    ## 'unique_events'
    occurence_index <- match(TRUE, table = matched_events[i, ])

    ## If there is TRUE value in a given matrix's row
    if (!is.na(occurence_index))
    {
        sorted_events %<>%
            summarise(
                official_event = official_events[occurence_index],
                unique_event = unique_events[i],
                events_involved = 1
            ) %>%
            bind_rows(sorted_events, .)
    }
}

## How does 'sorted_events' data frame look like?
head(sorted_events)

##      official_event       unique_event events_involved
## 1           Tornado            TORNADO               1
## 2              Hail               HAIL               1
## 3      Winter Storm       WINTER STORM               1
## 4 Thunderstorm Wind THUNDERSTORM WINDS               1
## 5        Heavy Rain         HEAVY RAIN               1
## 6         Lightning          LIGHTNING               1

## Remove unique events that are already sorted out to 'sorted_events'
## data frame
unique_events <-
    unique_events[
        !apply(
            matched_events,
            MARGIN = 1,
            FUN = function(matrix_row) any(matrix_row)
        )
    ]

## How many unsorted unique events remain?
length(unique_events)

## [1] 825

2.2 Determine unique event types that contain two official event types

As mentioned above there are unique events which contain several (2 and 3) events recorded in a single row. Let’s find unique event types which contain two events at once:

for (i in seq_along(official_events))
{
    matched_events <-
        sapply(
            ## Two official event types are surrounded by regex
            paste(
                ## Unique event type may begin with any non character signs
                ## quantity
                "^[^a-z]*",
                official_events[i],
                ## - First official event type may end with 's', 'es' or 'ing'.
                ## - There must be '/', '&' or 'and' between two
                ##   official event types.
                ## - There also may be any quantity of non character signs.
                "(s|es|ing)?[^a-z]*(\\/|&|\\band\\b)[^a-z]*",
                official_events,
                ## - Second official event type may end with 's', 'es' or 'ing'.
                ## - Unique event type may end with any non character signs
                ##   quantity.
                "(s|es|ing)?[^a-z]*$",
                sep = ""
            ),
            FUN = grepl,
            unique_events,
            ignore.case = TRUE
        )

    ## 'unique_events' vector indices which have matched during a given
    ## loop iteration.
    occurence_indices <- integer()

    ## Find matched patterns among 'unique_events' vector
    for (matrix_row in 1:nrow(matched_events))
    {
        occurence_index <- match(TRUE, table = matched_events[matrix_row, ])

        ## If there is an index present in a given vector element
        if (!is.na(occurence_index))
        {
            ## If same official event type is written twice in a given
            ## unique event type
            if (occurence_index == i)
            {
                warning("Following unique event type consists of the same",
                    " official event type, ", official_events[i],
                    ", written twice:\n", unique_events[matrix_row])

                next
            }

            occurence_indices[length(occurence_indices) + 1] <- matrix_row

            ## Add both official event types involved in unique event type
            sorted_events %<>%
                summarise(
                    official_event = official_events[i],
                    unique_event = unique_events[matrix_row],
                    events_involved = 2
                ) %>%
                bind_rows(sorted_events, .)
            sorted_events %<>%
                summarise(
                    official_event = official_events[occurence_index],
                    unique_event = unique_events[matrix_row],
                    events_involved = 2
                ) %>%
                bind_rows(sorted_events, .)
        }
    }

    ## If there are matched elements that need to be removed
    if (length(occurence_indices))
    {
        unique_events <- unique_events[-c(occurence_indices)]
    }
}

## How many sorted events now? This time each matched unique event adds
## two records to 'sorted_events' data frame. This because we've found
## unique events that contain two official events in a single record
nrow(sorted_events)

## [1] 274

## How do newly added rows look like?
tail(sorted_events)

##     official_event            unique_event events_involved
## 269     Waterspout     WATERSPOUT/ TORNADO               2
## 270        Tornado     WATERSPOUT/ TORNADO               2
## 271   Winter Storm  WINTER STORM/HIGH WIND               2
## 272      High Wind  WINTER STORM/HIGH WIND               2
## 273   Winter Storm WINTER STORM/HIGH WINDS               2
## 274      High Wind WINTER STORM/HIGH WINDS               2

## How many unsorted unique events remain?
length(unique_events)

## [1] 768

2.3 Determine unique event types that contain three official event types

In the previous step we have been searching for unique event types which contain two events at once. At this step we try to find three official events in a single unique event:

for (event1 in seq_along(official_events))
{
    for (event2 in seq_along(official_events))
    {
        matched_events <-
            sapply(
                ## Three official event types are surrounded by regex
                paste(
                    ## Unique event type may begin with any non character signs
                    ## quantity
                    "^[^a-z]*",
                    official_events[event1],
                    ## - First official event type may end with 's', 'es' or
                    ##   'ing'.
                    ## - There must be '/', '&' or 'and' between first and
                    ##   second official event type.
                    ## - There also may be any quantity of non character signs.
                    "(s|es|ing)?[^a-z]*(\\/|&|\\band\\b)[^a-z]*",
                    official_events[event2],
                    ## - Second official event type may end with 's', 'es' or
                    ##   'ing'.
                    ## - There must be '/', '&' or 'and' between second and
                    ##   third official event type.
                    ## - There also may be any quantity of non character signs.
                    "(s|es|ing)?[^a-z]*(\\/|&|\\band\\b)[^a-z]*",
                    official_events,
                    ## - Third official event type may end with 's', 'es' or
                    ##   'ing'.
                    ## - Unique event type may end with anty non character signs
                    ##   quantity.
                    "(s|es|ing)?[^a-z]*$",
                    sep = ""
                ),
                FUN = grepl,
                unique_events,
                ignore.case = TRUE
            )

        ## 'unique_events' vector indices which have matched during a given
        ## loop iteration.
        occurence_indices <- integer()

        ## Find matched patterns among 'unique_events' vector
        for (matrix_row in 1:nrow(matched_events))
        {
            occurence_index <- match(TRUE, table = matched_events[matrix_row, ])

            ## If there is an index present in a given vector element
            if (!is.na(occurence_index))
            {
                ## If same official event type is written twice or three times
                ## in a given unique event type
                if (occurence_index == event1 ||
                    occurence_index == event2 ||
                    event1 == event2)
                {
                    warning("Following unique event type consists of the same",
                        " official event types written twice or three times:\n",
                        unique_events[matrix_row])

                    next
                }

                occurence_indices[length(occurence_indices) + 1] <- matrix_row

                ## Add all three official event types involved in
                ## unique event type
                sorted_events %<>%
                    summarise(
                        official_event = official_events[event1],
                        unique_event = unique_events[matrix_row],
                        events_involved = 3
                    ) %>%
                    bind_rows(sorted_events, .)
                sorted_events %<>%
                    summarise(
                        official_event = official_events[event2],
                        unique_event = unique_events[matrix_row],
                        events_involved = 3
                    ) %>%
                    bind_rows(sorted_events, .)
                sorted_events %<>%
                    summarise(
                        official_event = official_events[occurence_index],
                        unique_event = unique_events[matrix_row],
                        events_involved = 3
                    ) %>%
                    bind_rows(sorted_events, .)
            }
        }

        ## If there are matched elements that need to be removed
        if (length(occurence_indices))
        {
            unique_events <- unique_events[-c(occurence_indices)]
        }
    }
}

## How many sorted events now? This time each matched unique event adds
## three records to 'sorted_events' data frame. This because we've found
## unique events that contain three official events in a single record
nrow(sorted_events)

## [1] 280

## How do newly added rows look like?
tail(sorted_events)

##     official_event                  unique_event events_involved
## 275     Heavy Snow HEAVY SNOW/BLIZZARD/AVALANCHE               3
## 276       Blizzard HEAVY SNOW/BLIZZARD/AVALANCHE               3
## 277      Avalanche HEAVY SNOW/BLIZZARD/AVALANCHE               3
## 278     Heavy Snow HEAVY SNOW/HIGH WINDS & FLOOD               3
## 279      High Wind HEAVY SNOW/HIGH WINDS & FLOOD               3
## 280          Flood HEAVY SNOW/HIGH WINDS & FLOOD               3

## How many unsorted unique events remain?
length(unique_events)

## [1] 766

2.4 Determine unique event types that contain unique words from official event types

During this step we’ll investigate official events and find unique words among them, i.e. we’ll search for the words that have only single occurrence in the whole official_events vector. These words will be searched in the unique_events’s elements. Additionally we’ll mark official_events vector duplicated words. So the unique events we need should have single unique word and no duplicated words.

Looking ahead there will be three false positives Icestorm/Blizzard, EXTREMELY WET and RECORD LOW RAINFALL. With a view not to mess up sorted_events let’s solve this beforehand:

## At this cleaning step there is searching for unique words from official
## event types performed. "Icestorm/Blizzard" unique event type is found during
## mentioned search. But it's not appropriate to be classified as single
## official event type because it contains two events, "Ice Storm" and "Blizzard".
## That's why it needs to be classified manually before performing mentioned
## cleaning stage
sorted_events %<>%
    summarise(
        official_event = official_events[official_events == "Ice Storm"],
        unique_event = unique_events[unique_events == "Icestorm/Blizzard"],
        events_involved = 2
    ) %>%
    bind_rows(sorted_events, .)
sorted_events %<>%
    summarise(
        official_event = official_events[official_events == "Blizzard"],
        unique_event = unique_events[unique_events == "Icestorm/Blizzard"],
        events_involved = 2
    ) %>%
    bind_rows(sorted_events, .)
## 'RECORD LOW RAINFALL' unique event is classified as
## 'Heavy Rain' official event. Actually it's the opposite - 'Drought' :)
sorted_events %<>%
    summarise(
        official_event = official_events[official_events == "Drought"],
        unique_event = unique_events[unique_events == "RECORD LOW RAINFALL"],
        events_involved = 1
    ) %>%
    bind_rows(sorted_events, .)

unique_events <-
    unique_events[
        ## Also there is "EXTREMELY WET" unique event type classified as
        ## "Extreme Cold/Wind Chill" official event type. Nevertheless there is
        ## no official event type which can be classified as "EXTREMELY WET",
        ## hence it should be ignored
        unique_events != "EXTREMELY WET" &
            unique_events != "Icestorm/Blizzard" &
            unique_events != "RECORD LOW RAINFALL"
    ]

Now let’s perform searching:

## All words contained in 'official_events' vector (duplicates included)
official_events_all_words <-
    strsplit(
        as.character(official_events) %>% tolower(),
        ## Split by noncharacter symbols
        split = "[^a-zA-Z]"
    ) %>%
    unlist() %>%
    ## There will be an empty string produced which is not needed
    .[. != ""]
head(official_events_all_words)

## [1] "astronomical" "low"          "tide"         "avalanche"   
## [5] "blizzard"     "coastal"

nonunique_words <-
    official_events_all_words[duplicated(official_events_all_words)]
## Unique words that have no duplicates (i.e. have single occurrence in
## 'official_events_all_words')
official_events_unique_words <-
    official_events_all_words[!(official_events_all_words %in% nonunique_words)] %>%
    ## - Unique word "freezing" is inappropriate in the context of
    ##   official events unique words. Mentioned word implies
    ##   "Freezing Fog" collocation, but such word combination is already
    ##   filtered when searching for exactly matching patterns between
    ##   official event types and unique event types.
    ## - Unique words "ice" and "weather" imply "Ice Storm" and "Winter Weather"
    ##   official event types accordingly. But mentioned unique words are used
    ##   in unique event types to denote not only "Ice Storm" and "Winter Weather"
    ##   events, hence they cannot be used for unique words searching.
    ## - Unique word "excessive" implies "Excessive Heat" official event type.
    ##   But this word is used as a common adjective in unique event types.
    ##   Note. There is also "extreme" adjective which implies
    ##   "Extreme Cold/Wind Chill" official event type. But it seems that when
    ##   searching in unique event types for single unique word occurrence AND
    ##   nonunique words absence, "extreme" is occurs only for its
    ##   official event type
    .[. != "freezing" & . != "ice" & . != "weather" & . != "excessive"]
head(official_events_unique_words)

## [1] "astronomical" "low"          "avalanche"    "blizzard"    
## [5] "coastal"      "debris"

## Nonunique words that have duplicates (i.e. have multiple occurrences in
## 'official_events_all_words')
official_events_nonunique_words <-
    official_events_all_words[official_events_all_words %in% nonunique_words] %>%
    unique()
head(official_events_nonunique_words)

## [1] "tide"  "flood" "cold"  "wind"  "chill" "dense"

## Shows which unique words belong to which official event types
unique_words_correspondence_table <-
    data.frame(
        unique_word = official_events_unique_words,
        official_event_type =
            factor(
                character(length = length(official_events_unique_words)),
                levels = levels(official_events)
            )
    )

## The matrix which shows at which index of 'official_events' vector
## a given unique word occurs. Columns represent unique words from
## official events while rows represent indices from 'official_events'.
unique_words_in_official_events <-
    sapply(
        paste("\\b", official_events_unique_words, "\\b", sep = ""),
        FUN = grepl,
        official_events,
        ignore.case = TRUE
    )

## Fill 'unique_words_correspondence_table' data frame's
## 'official_event_type' variable
for (i in 1:ncol(unique_words_in_official_events))
{
    ## Index of element corresponding to unique word
    official_event_type_index <-
        match(TRUE, unique_words_in_official_events[, i])

    unique_words_correspondence_table[i, ]$official_event_type <-
        official_events[official_event_type_index]
}

## Retain only one unique word for a given official event type
## (i.e. official event type can have only one unique word as a representative)
unique_words_correspondence_table <-
    filter(
        .data = unique_words_correspondence_table,
        !duplicated(official_event_type)
    )

## How does created data frame look like?
head(unique_words_correspondence_table)

##    unique_word   official_event_type
## 1 astronomical Astronomical Low Tide
## 2    avalanche             Avalanche
## 3     blizzard              Blizzard
## 4      coastal         Coastal Flood
## 5       debris           Debris Flow
## 6        smoke           Dense Smoke

nrow(unique_words_correspondence_table)

## [1] 28

## The matrix which shows where official events unique words occur in
## all unique events vector. Columns represent each unique word while rows
## represent each of total unique events
unique_words_in_unique_events <-
    sapply(
        ## Here unique words are searched as part of whole word too, and
        ## not as the distinct words only
        unique_words_correspondence_table$unique_word,
        FUN = grepl,
        unique_events,
        ignore.case = TRUE
    )
dim(unique_words_in_unique_events)

## [1] 763  28

## Logical vector describing whether a given 'unique_events' element contains
## nonunique word from official event types vector or not
nonunique_word_in_unique_event <-
    sapply(
        paste("\\b", official_events_nonunique_words, "\\b", sep = ""),
        FUN = grepl,
        unique_events,
        ignore.case = TRUE
    ) %>%
    apply(
        MARGIN = 1,
        FUN = function(matrix_row) any(matrix_row[matrix_row])
    )
length(nonunique_word_in_unique_event)

## [1] 763

## 'unique_events' vector indicies which are perceived as matched during
## official event types unique words searching
occurence_indices <- integer()

## Find unique event types which contain single unique word AND do not contain
## nonunique words from official event types
for (i in seq_along(unique_events))
{
    unique_words_quantity <-
        unique_words_in_unique_events[i, ][unique_words_in_unique_events[i, ]] %>%
        length()

    ## If there is single official event types unique word present in
    ## a unique event type AND there are no official event types word duplicates
    ## present
    if (unique_words_quantity == 1 & !nonunique_word_in_unique_event[i])
    {
        occurence_indices[length(occurence_indices) + 1] <- i

        unique_word_index <- match(TRUE, unique_words_in_unique_events[i, ])

        sorted_events %<>%
            summarise(
                official_event =
                    unique_words_correspondence_table[unique_word_index, ]$official_event_type,
                unique_event = unique_events[i],
                events_involved = 1
            ) %>%
            bind_rows(sorted_events, .)
    }
}

## If there are matched unique event types that need to be removed
if (length(occurence_indices))
{
    unique_events <- unique_events[-c(occurence_indices)]
}

## How many sorted events now? This time each matched unique event adds
## single record to 'sorted_events' data frame.
nrow(sorted_events)

## [1] 342

## How do newly added rows look like?
tail(sorted_events)

##              official_event                   unique_event events_involved
## 337 Extreme Cold/Wind Chill EXTREME WINDCHILL TEMPERATURES               1
## 338              Heavy Rain            LIGHT FREEZING RAIN               1
## 339             Dense Smoke                          SMOKE               1
## 340               High Surf                 HAZARDOUS SURF               1
## 341     Hurricane (Typhoon)              HURRICANE/TYPHOON               1
## 342            Volcanic Ash               VOLCANIC ASHFALL               1

## How many unsorted unique events remain?
length(unique_events)

## [1] 704

2.5 Find matched official event types in unique event types using amatch() function

Finally let’s find matches between official and unique event types using amatch() function from stringdist package:

library(stringdist)

## Bigger 'maxDist' values will give false positives
matched_events <- amatch(unique_events, table = official_events, maxDist = 2)

## 'unique_events' vector indicies which are perceived as matched during
## official event types unique words searching
occurence_indices <- integer()

for (i in seq_along(matched_events))
{
    ## If unique event type has matched with official event type
    if (!is.na(matched_events[i]))
    {
        occurence_indices[length(occurence_indices) + 1] <- i
        
        sorted_events %<>%
            summarise(
                official_event = official_events[matched_events[i]],
                unique_event = unique_events[i],
                events_involved = 1
            ) %>%
            bind_rows(sorted_events, .)
    }
}

## If there are matched unique event types that need to be removed
if (length(occurence_indices))
{
    unique_events <- unique_events[-c(occurence_indices)]
}

## How many sorted events now?
nrow(sorted_events)

## [1] 343

## How does newly added row look like?
tail(sorted_events, n = 1)

##       official_event     unique_event events_involved
## 343 Lake-Effect Snow Lake Effect Snow               1

## How many unsorted unique events remain?
length(unique_events)

## [1] 703

Cleaning summation.

Performed searching helped us find 29%

((unique(sorted_events$unique_event) %>% length()) /
    (unique(storm_data$EVTYPE) %>% length())) %>%
    round(digits = 2)

## [1] 0.29

of official events among unique events. Nevertheless such amount covers 73%

((storm_data$EVTYPE %in% unique(sorted_events$unique_event) %>% .[.] %>% length()) /
    nrow(storm_data)) %>%
    round(digits = 2)

## [1] 0.73

of the storm data records.

## Apply obtained results to 'storm_data' data frame: retain only records that
## have determined events
storm_data %<>% filter(EVTYPE %in% unique(sorted_events$unique_event))

nrow(storm_data)

## [1] 661472

Stage 3. Injury and Damage Values Calculating

At this stage cleaned data needs to be evaluated and distributed among official event types. As was said at the synopsis, economic impact values are calculated by 2016 year’s US dollar buying power equivalent. The matter of damage values represented in the storm database is that they reflect US dollar buying power on their year. But things that cost x in 1980 on average cost x * 2.95 in 2017. Thus we need to determine the coefficients which convert all different year values to a given year equivalent. Consumer Price Indexes (CPI) will help to solve this issue:

library(quantmod)

## Download Consumer Price Indexes (CPI) from
## St. Louis Federal Reserve Bank's FRED system. The data downloaded is
## saved as time series 'CPIAUCSL' object.
## Last successful downloading: Feb 20th, 2017, 11:00 EET
getSymbols.FRED(Symbols = "CPIAUCSL", env = .GlobalEnv)

## [1] "CPIAUCSL"

class(CPIAUCSL)

## [1] "xts" "zoo"

The data downloaded is a time series object which represent monthly CPIs beginning from

first(CPIAUCSL)

##            CPIAUCSL
## 1947-01-01    21.48

Since storm data set contains values beginning from

head(storm_data$BGN_DATE, n = 1)

## [1] "4/18/1950 0:00:00"

downloaded CPIs are appropriate for our purpose. CPIAUCSL time series contains monthly data

first(CPIAUCSL, n = 5)

##            CPIAUCSL
## 1947-01-01    21.48
## 1947-02-01    21.62
## 1947-03-01    22.00
## 1947-04-01    22.00
## 1947-05-01    21.95

This report focuses on yearly CPI values:

library(lubridate)

## A data frame which contains yearly CPIs.
CPIs <-
    apply.yearly(CPIAUCSL, FUN = mean) %>%
    data.frame(year(.) %>% as.factor())
colnames(CPIs) <- c("CPI", "year")
rownames(CPIs) <- NULL
head(CPIs)

##        CPI year
## 1 22.33167 1947
## 2 24.04500 1948
## 3 23.80917 1949
## 4 24.06250 1950
## 5 25.97333 1951
## 6 26.56667 1952

## Retain only year values for event date instead of whole date
storm_data$BGN_DATE %<>%
    mdy_hms() %>%
    year() %>%
    factor(levels = levels(CPIs$year))
storm_data %<>% rename(year = BGN_DATE)
head(storm_data)

##   year  EVTYPE F FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 1950 TORNADO 3          0       15    25.0          K       0           
## 2 1950 TORNADO 2          0        0     2.5          K       0           
## 3 1951 TORNADO 2          0        2    25.0          K       0           
## 4 1951 TORNADO 2          0        2     2.5          K       0           
## 5 1951 TORNADO 2          0        2     2.5          K       0           
## 6 1951 TORNADO 2          0        6     2.5          K       0

Now we have all necessary data sets to calculate and distribute injury and damage values among official event types:

storm_data contains row values distributed among unique event types

head(storm_data)

##   year  EVTYPE F FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 1950 TORNADO 3          0       15    25.0          K       0           
## 2 1950 TORNADO 2          0        0     2.5          K       0           
## 3 1951 TORNADO 2          0        2    25.0          K       0           
## 4 1951 TORNADO 2          0        2     2.5          K       0           
## 5 1951 TORNADO 2          0        2     2.5          K       0           
## 6 1951 TORNADO 2          0        6     2.5          K       0

sorted_events is a correspondence table which binds official event types with related unique event types and determines how many official events recorded in a single unique event record

head(sorted_events)

##      official_event       unique_event events_involved
## 1           Tornado            TORNADO               1
## 2              Hail               HAIL               1
## 3      Winter Storm       WINTER STORM               1
## 4 Thunderstorm Wind THUNDERSTORM WINDS               1
## 5        Heavy Rain         HEAVY RAIN               1
## 6         Lightning          LIGHTNING               1

and CPIs containing yearly Consumer Price Indexes

head(CPIs)

##        CPI year
## 1 22.33167 1947
## 2 24.04500 1948
## 3 23.80917 1949
## 4 24.06250 1950
## 5 25.97333 1951
## 6 26.56667 1952

Let’s join them together:

storm_data %<>%
    ## Attach CPIs to storm data set by 'year' variable.
    left_join(y = CPIs) %>%
    ## Combine sorted event types with storm data set to determine how many
    ## official events involved in a given 'EVTYPE'
    left_join(y = sorted_events, by = c("EVTYPE" = "unique_event")) %>%
    ## Retain only official event type names and remove 'EVTYPE' variable
    select(year, official_event, F, FATALITIES:events_involved)
head(storm_data)

##   year official_event F FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 1 1950        Tornado 3          0       15    25.0          K       0
## 2 1950        Tornado 2          0        0     2.5          K       0
## 3 1951        Tornado 2          0        2    25.0          K       0
## 4 1951        Tornado 2          0        2     2.5          K       0
## 5 1951        Tornado 2          0        2     2.5          K       0
## 6 1951        Tornado 2          0        6     2.5          K       0
##   CROPDMGEXP      CPI events_involved
## 1            24.06250               1
## 2            24.06250               1
## 3            25.97333               1
## 4            25.97333               1
## 5            25.97333               1
## 6            25.97333               1

PROPDMGEXP and CROPDMGEXP variables represent exponents for PROPDMG and CROPDMG accordingly. E.g. the final value of a given PROPDMG variable will be

PROPDMG * 10^PROPDMGEXP

This is true only for cases where PROPDMGEXP value can be coerced to numeric (i.e. is a digit). There are also additional ‑EXP values among simple digits: “”, “-”, “+” and metric prefixes (hecto, kilo, mega and “B” perceived as billion):

levels(storm_data$PROPDMGEXP)

##  [1] ""  "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"

levels(storm_data$CROPDMGEXP)

## [1] ""  "?" "0" "2" "B" "k" "K" "m" "M"

With a view to convert previous years’ values to the 2016’s ones we need to calculate the growth factor of each previous year relatively to the 2016 year:

## Calculate growth factor of the PCI relatively to 2016 year. Mentioned
## coefficient has annual mean value, so 2016 year is taken as the newest
## completed year.
growth_factor <-
    sapply(
        storm_data$CPI,
        FUN = function(x) CPIs$CPI[CPIs$year == 2016] / x
    )
head(growth_factor)

## [1] 9.974410 9.974410 9.240603 9.240603 9.240603 9.240603

Final step for this stage is to obtain real ‑DMG values. Each PROPDMG/CROPDMG value is calculated by the following formula

(…DMG * growth_factor * …DMGEXP ) / storm_data$events_involved

storm_data %<>%
    mutate(
        ## Final property damage formula:
        ## Property_damage =
        ##     (PROPDMG * growth_factor * propertyDamageExponent) /
        ##         officialEventsInAGivenEvent
        PROPDMG = PROPDMG * growth_factor * sapply(
            PROPDMGEXP,
            FUN = function(x)
            {
                if (x == "" | x == "-" | x == "+")
                {
                    return(1)
                }
                if (x == "?")
                {
                    return(0)
                }
                if (x == "h" | x == "H")
                {
                    return(10^2)
                }
                if (x == "K")
                {
                    return(10^3)
                }
                if (x == "m" | x == "M")
                {
                    return(10^6)
                }
                if (x == "B")
                {
                    return(10^9)
                }

                10^as.numeric(levels(storm_data$PROPDMGEXP)[x])
            }
        ) / events_involved,
        ## Final crop damage formula:
        ## Crop_damage =
        ##     (CROPDMG * growth_factor * cropDamageExponent) /
        ##         officialEventsInAGivenEvent
        CROPDMG = CROPDMG * growth_factor * sapply(
            CROPDMGEXP,
            FUN = function(x)
            {
                if (x == "")
                {
                    return(1)
                }
                if (x == "?")
                {
                    return(0)
                }
                if (x == "k" | x == "K")
                {
                    return(10^3)
                }
                if (x == "m" | x == "M")
                {
                    return(10^6)
                }
                if (x == "B")
                {
                    return(10^9)
                }

                10^as.numeric(levels(storm_data$CROPDMGEXP)[x])
            }
        ) / events_involved
    ) %>%
    ## Get rid of unnecessary variables
    select(-PROPDMGEXP, -CROPDMGEXP, -CPI, -events_involved)

head(storm_data)

##   year official_event F FATALITIES INJURIES   PROPDMG CROPDMG
## 1 1950        Tornado 3          0       15 249360.26       0
## 2 1950        Tornado 2          0        0  24936.03       0
## 3 1951        Tornado 2          0        2 231015.06       0
## 4 1951        Tornado 2          0        2  23101.51       0
## 5 1951        Tornado 2          0        2  23101.51       0
## 6 1951        Tornado 2          0        6  23101.51       0

Stage 4. Obtained Data Analysis

The final stage is to analyse obtained data and visualize it. When reading our data set for processing among other variables we’ve also downloaded Fujita scale tornado categories, F. That’s because tornadoes consequences may differ depending on a tornado severity. For example F0 tornado may cause relatively low damage while F5 is a huge catastrophe. If we were calculating all tornadoes without their classification we would get trivial injury and damage values, which would mislead the whole investigation. It follows that we need to allocate different tornado categories. There are 7 levels of F variable:

levels(storm_data$F)

## [1] ""  "0" "1" "2" "3" "4" "5"

Since we don’t know what category is “” level and this level is rather rare

((filter(.data = storm_data, official_event == "Tornado" & F == "") %>% nrow()) /
    (filter(.data = storm_data, official_event == "Tornado") %>% nrow())) %>%
    round(digits = 2)

## [1] 0.03

it’s better to ignore these values.

## Tornado has 6 classification types of severity, from F0 to F5. Tornadoes
## health and economic impact needs to be splitted among F categories
tornado_data <- filter(.data = storm_data, official_event == "Tornado" & F != "")
## Concatenate classification to each tornado event type
tornado_data$official_event <-
    paste(tornado_data$official_event, " F", tornado_data$F, sep = "")

## Change all "Tornado" event types to "Tornado F[number]"
storm_data %<>%
    filter(official_event != "Tornado") %>%
    bind_rows(tornado_data)

## Warning in bind_rows_(x, .id): binding factor and character vector,
## coercing into character vector

storm_data$official_event %<>% as.factor()
tail(storm_data)

##        year official_event F FATALITIES INJURIES   PROPDMG CROPDMG
## 660283 2011     Tornado F2 2          0        0  266768.2       0
## 660284 2011     Tornado F1 1          0        0  533536.5       0
## 660285 2011     Tornado F1 1          0        0  106707.3       0
## 660286 2011     Tornado F0 0          0        0       0.0       0
## 660287 2011     Tornado F0 0          0        0       0.0       0
## 660288 2011     Tornado F1 1          0        2 4268291.8       0

For now it’s time to create health harm and economic damage plots.

Create health harm plot

## Used to create health harm plot
health_harm_summary <-
    group_by(.data = storm_data, official_event) %>%
    summarise(
        ## Health harm average: fatalities and injuries
        health_harm = mean(FATALITIES + INJURIES),
        ## Fatality fraction in health harm average
        fatality_rate = (sum(FATALITIES) / sum(FATALITIES + INJURIES))
    )

There are some NaNs produced. Let’s see which variables have them:

sapply(health_harm_summary, FUN = function(variable) anyNA(variable))

## official_event    health_harm  fatality_rate 
##          FALSE          FALSE           TRUE

There are NaNs in fatality_rate variable. It’s better to replace them by 0s:

health_harm_summary$fatality_rate %<>%
    sapply(
        FUN = function(x)
        {
            if (is.nan(x))
            {
                return(0)
            }
            
            x
        }
    )

Create health harm plot:

library(ggplot2)
library(tidyr)

health_harm_plot <-
    ggplot(data = health_harm_summary) +
        geom_col(
            mapping = aes(
                x = reorder(official_event, health_harm),
                y = health_harm,
                fill = fatality_rate * 100
            )
        ) +
        scale_fill_continuous(
            name = "Fatality rate, %",
            low = "#56B1F7",
            high = "#132B43"
        ) +
        labs(
            title = "Average Health Harm (Injuries + Fatalities) for Each Event Type",
            x = "Event types",
            y = "Average people injured"
        ) +
        coord_flip()

Create economic damage plot

## Used to create economic damage plot
economic_damage_summary <-
    group_by(.data = storm_data, official_event) %>%
    summarise(
        property_damage = mean(PROPDMG),
        crop_damage = mean(CROPDMG),
        total_damage = property_damage + crop_damage
    ) %>%
    ## Rearrange data frame to represent long data instead of wide
    gather(key = damage_type, value = value, property_damage:crop_damage) %>%
    arrange(desc(total_damage)) %>%
    ## Damage values vary from 10 to over 400,000,000 US dollars. This additional
    ## variable, 'facet_area', is used to split different damage values among
    ## facets for better visualizing
    mutate(
        facet_area = factor(
            c(rep(1L, 12), rep(2L, 38), rep(3L, 54)),
            labels = c("over 10 mln.", "100 ths. - 10 mln.", "less than 100 ths."),
        )
    )

Create economic damage plot:

library(scales)

economic_damage_plot <-
    ggplot(data = economic_damage_summary) +
        geom_col(
            mapping = aes(
                x = reorder(official_event, value),
                y = value,
                fill = damage_type
            )
        ) +
        scale_y_continuous(labels = comma) +
        scale_fill_discrete(
            name = "",
            labels = c("Crop damage", "Property damage")
        ) +
        facet_grid(facets = ~facet_area, scales = "free_x") +
        theme(axis.text.x = element_text(hjust = 1, angle = 45)) +
        labs(
            title = paste(
                "Average Economic Damage (Property + Crop)",
                "for Each Event Type, 2016 Year's Buying Power Equivalent",
                sep = "\n"
            ),
            x = "Event types",
            y = "Average damage, US dollars"
        ) +
        coord_flip()

Results

Average health harm

The most dangerous weather events for human health are tornadoes of F5 and F4 categories which cause 99 and 33 injuries on average accordingly. There are 9% fatal cases for Tornado F5 and 6% for Tornado F4 among people injured. Tsunamies are also have high health risks with 8 average injuries and 20% fatal cases among people injured.

The other point which should not be ignored, high fatal cases levels. An average recorded Sleet weather event have 0.03 people injured. However all (100%) are fatal cases. Other high fatal level cases which deserve attention are:

Cold/Wind Chill (89% fatal cases)
Storm Surge/Tide (69% fatal cases)
Avalanche (57% fatal cases)
Rip Current (52% fatal cases)

Following bar plot shows average injuries from different weather event types. Color tones show fatal cases rate among injured people: darker colors mean bigger number of fatal cases.

Average economic damage

The biggest economic damage cause hurricanes. An average hurricane causes $402,048,114 of economic damage ($376,306,559 of property damage and $25,741,556 of crop damage). Tornadoes of F5 and F4 categories go next with $133,536,352 and $47,903,101 of total damage respectively.

Additional point which should be taken into account are high crop damages caused by Droughts ($7,572,106) and Ice stroms ($4,037,989).

Following bar plot shows average economic damage from different weather event types. Since economic damage varies from 10 to over 400,000,000 US dollars, damage values are split on three groups with different x‑axis scales for better readability.