Synopsis

We try to answer two questions that might be answered by the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database:

Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

Data Processing

We will need the folowing R packages to do this analysis

library(tidyverse)
library(janitor)
library(readr)
library(tidyr)
library(dplyr)
library(stringr)
library(lubridate)
library(inspectdf)
library(fuzzyjoin)
library(ggplot2)
library(forcats)

The data is in this file. The following code loads it into the variable storm_data

#We expect the data is in the project_data subdirectory. If we cannot find it there, we will download it
if (!file.exists(file.path("project_data", "StormData.csv.bz2")))
{
  download.file(
    "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
    "project_data/StormData.csv.bz2"
  )
}
project_data <-
  read_csv(
    file.path("project_data", "StormData.csv.bz2"),
    trim_ws = TRUE,
    col_types =
      cols(
        STATE__ = col_double(),
        BGN_DATE = col_character(),
        BGN_TIME = col_character(),
        TIME_ZONE = col_character(),
        COUNTY = col_double(),
        COUNTYNAME = col_character(),
        STATE = col_character(),
        EVTYPE = col_character(),
        BGN_RANGE = col_double(),
        BGN_AZI = col_character(),
        BGN_LOCATI = col_character(),
        END_DATE = col_character(),
        END_TIME = col_character(),
        COUNTY_END = col_character(),
        COUNTYENDN = col_character(),
        END_RANGE = col_double(),
        END_AZI = col_character(),
        END_LOCATI = col_character(),
        LENGTH = col_double(),
        WIDTH = col_double(),
        F = col_double(),
        MAG = col_double(),
        FATALITIES = col_double(),
        INJURIES = col_double(),
        PROPDMG = col_double(),
        PROPDMGEXP = col_character(),
        CROPDMG = col_double(),
        CROPDMGEXP = col_character(),
        WFO = col_character(),
        STATEOFFIC = col_character(),
        ZONENAMES = col_character(),
        LATITUDE = col_double(),
        LONGITUDE = col_double(),
        LATITUDE_E = col_double(),
        LONGITUDE_ = col_double(),
        REMARKS = col_character(),
        REFNUM = col_double()
      )
    
  )

The data has been loaded into the project_data variable.

Data Processing

Data cleaning, focus on economic damage

The variables that seem to have the most economic meaning are: - PROPDMG Property damage - PROPDMGEXP Property damage multiplier - CROPDMG Crop damage - CROPDMGEXP Crop damage multiplier

As the documentation for the dataset states, these are estimates of damages caused by the event and they are captured in dollar value. Lets have a look at the shape of them:

project_data%>%select(contains("DMG"))%>%summary()

##     PROPDMG         PROPDMGEXP           CROPDMG         CROPDMGEXP       
##  Min.   :   0.00   Length:902297      Min.   :  0.000   Length:902297     
##  1st Qu.:   0.00   Class :character   1st Qu.:  0.000   Class :character  
##  Median :   0.00   Mode  :character   Median :  0.000   Mode  :character  
##  Mean   :  12.06                      Mean   :  1.527                     
##  3rd Qu.:   0.50                      3rd Qu.:  0.000                     
##  Max.   :5000.00                      Max.   :990.000

We are interested in events that actually caused damage, so it seems reasonable to do without anything having zero in both DMG variables. Also, if there should be any NA, we are not interested in them:

project_data %>% 
  #Recodes na in damage values as zero
  mutate(PROPDMG=ifelse(is.na(PROPDMG),0,PROPDMG),
         CROPDMG=ifelse(is.na(CROPDMG),0,CROPDMG),
         TOTDMG=PROPDMG+CROPDMG
         )%>% 
  #Requires either one to be non zero
  filter(TOTDMG>0) ->current_data
current_data%>%select(contains("DMG"))%>%summary()

##     PROPDMG         PROPDMGEXP           CROPDMG         CROPDMGEXP       
##  Min.   :   0.00   Length:245031      Min.   :  0.000   Length:245031     
##  1st Qu.:   2.00   Class :character   1st Qu.:  0.000   Class :character  
##  Median :   8.00   Mode  :character   Median :  0.000   Mode  :character  
##  Mean   :  44.42                      Mean   :  5.623                     
##  3rd Qu.:  25.00                      3rd Qu.:  0.000                     
##  Max.   :5000.00                      Max.   :990.000                     
##      TOTDMG       
##  Min.   :   0.01  
##  1st Qu.:   2.50  
##  Median :  10.00  
##  Mean   :  50.04  
##  3rd Qu.:  30.00  
##  Max.   :5000.00

Now we have this new dataset, current_data, that has a TOTDMG variable with the sum of property and crop damage. Lets see how those components of total damage are behaving by sampling 10 of them:

current_data%>%sample_n(10)%>%select(contains("DMG"))

The documentation for this dataset states:

Estimates should be rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.

So the alphabetical character reffered to in the documentation seems to be the content of the *EXP columns and, as stated, they represent a multiplier or index of magnitude of the damage in dollars.

So lets see if we have NAs in this rather important variables for multiplying estimates of damages:

current_data%>%select(contains("DMG"))%>%inspect_na()

The magnitude multipliers have a large number of NA. Now, these are only significant if they have a corresponding larger than 0 value in their DMG counterpart variable. Lets see the CROP columns with only rows that actually have damage in them, but have NA in their multiplier “EXP” column:

current_data %>% filter(is.na(CROPDMGEXP)&CROPDMG>0)%>%select(contains("DMG"))

There are very few rows with values that have NA as multiplier in CROPDMGEXP and some value in CROPDMG.

However, if we were just to drop the row due to those NA, we might loose some PROPDMG correctly encoded value. The preceeding table shows that the total damage lost in that case would be 5’505,000 USD in total.

Lets check the property side of the damages with the same strategy:

current_data %>% filter(is.na(PROPDMGEXP)&PROPDMG>0)%>%select(contains("DMG"))

76 rows have value in property damage but no multiplier. If we were to get rid of them, would we loose a lot in CROPDMG?

current_data %>% filter(is.na(PROPDMGEXP)&PROPDMG>0 &CROPDMG>0)%>%select(contains("DMG"))

No. We would loose about 74K worth of crop damages.

So since we dont have much to loose, lets get rid of any and all rows that have an NA in either PROPDMGEXP or CROPDMGEXP.

current_data %>% 
  filter(!is.na(PROPDMGEXP))%>% 
  filter(!is.na(CROPDMGEXP))->current_data
current_data %>% select(contains("DMG")) %>% inspect_na()

So now the current_data set is the subset of the project_data that: - If it had an NA in any DMG variable, it was replaced with 0 - does not have zero in one or the other DMG value variables - absolutely no NAs were allowed in the magnitude multipliers. Those rows were removed.

The economic damage multiplier columns

Lets see what multipliers actually are there in this variables:

current_data%>%filter(PROPDMG!=0)%>%pull(PROPDMGEXP)%>%table

## .
##     0     3     5     B     K     m     M 
##     4     1     2    24 90175     1  3924

So the most important PROPDMGEXP are clearly K,M,B. We might only use this ones. However, the documentation also states that:

If additional precision is available, it may be provided in the narrative part of the entry.

So lets have a look at this narrative data for those seemingly not as signifficant damage types:

current_data%>%filter(PROPDMG!=0)%>%pull(PROPDMGEXP)%>%table%>%names()->mults
mults[! mults %in% c("K","B","M")] -> mults
current_data%>%filter(PROPDMGEXP %in% mults & !is.na(REMARKS) & PROPDMG!=0) %>%
  select(PROPDMG,PROPDMGEXP,REMARKS)

As can be seen from the remarks, this records do encode property damage. However, it seems most of them are problematic estimates that we could safely ignore, or they were finally put in the remarks.

We will thus only accept records with K, M and B as multipliers

Lets do the same with the CROPDMGEXP variable:

current_data%>%filter(CROPDMG!=0)%>%pull(CROPDMGEXP)%>%table

## .
##     0     B     k     K     m     M 
##    11     3    21 16297     1  1477

current_data%>%filter(CROPDMG!=0)%>%pull(CROPDMGEXP)%>%table%>%names()->mults
mults[! mults %in% c("K","B","M")] -> mults
current_data%>%filter(CROPDMGEXP %in% mults & !is.na(REMARKS) & CROPDMG!=0) %>%
  select(CROPDMG,CROPDMGEXP,REMARKS)

Again Hurricane Opal and then some other damages. We will only accept records with B,K and M as multipliers.

We will now create plain numeric, full dollar value columns PROPERTY_DAMAGE and CROP_DAMAGE, multiplying accordingly. We will also summarize both economic damage columns into TOTAL_DAMAGE, which will now have the mutliplier taken into account:

current_data %>% mutate(
  PROP_MULT=recode(
    PROPDMGEXP,K=1000,M=1000000,B=1000000000,.default=0),
  CROP_MULT=recode(
    CROPDMGEXP,K=1000,M=1000000,B=1000000000,.default=0),
  PROPERTY_DAMAGE=PROPDMG*PROP_MULT,
  CROP_DAMAGE=CROPDMG*CROP_MULT,
  TOTAL_DAMAGE=PROPERTY_DAMAGE+CROP_DAMAGE
  )->current_data
current_data %>%  select(contains("CROP"),contains("DMG"),contains("DAMAGE")) %>%summary()

##     CROPDMG       CROPDMGEXP          CROP_MULT        
##  Min.   :  0.0   Length:95707       Min.   :0.000e+00  
##  1st Qu.:  0.0   Class :character   1st Qu.:1.000e+03  
##  Median :  0.0   Mode  :character   Median :1.000e+03  
##  Mean   : 11.9                      Mean   :4.843e+04  
##  3rd Qu.:  0.0                      3rd Qu.:1.000e+03  
##  Max.   :978.0                      Max.   :1.000e+09  
##   CROP_DAMAGE           PROPDMG         PROPDMGEXP       
##  Min.   :0.000e+00   Min.   :   0.00   Length:95707      
##  1st Qu.:0.000e+00   1st Qu.:   2.00   Class :character  
##  Median :0.000e+00   Median :   8.00   Mode  :character  
##  Mean   :3.209e+05   Mean   :  42.17                     
##  3rd Qu.:0.000e+00   3rd Qu.:  25.00                     
##  Max.   :5.000e+09   Max.   :5000.00                     
##      TOTDMG        PROPERTY_DAMAGE      TOTAL_DAMAGE      
##  Min.   :   0.01   Min.   :0.000e+00   Min.   :0.000e+00  
##  1st Qu.:   3.00   1st Qu.:3.000e+03   1st Qu.:3.000e+03  
##  Median :  10.00   Median :1.000e+04   Median :1.000e+04  
##  Mean   :  54.08   Mean   :2.411e+06   Mean   :2.732e+06  
##  3rd Qu.:  36.00   3rd Qu.:3.500e+04   3rd Qu.:5.000e+04  
##  Max.   :5000.00   Max.   :1.150e+11   Max.   :1.150e+11

We can now see a more interesting picture. We only have rows with valid and significant damage data in dollars. By the rows removed, we are seeing an order of magnitude of difference, as project_data has 902297 observations while current_data only has 95707.

However, as we are tasked with both health and economic data, we still need to have all this values in the main table, plus whatever health data we might need. To do that, an unique per row id table is ideal, as this would allow us to join the economic data we have processed back into the main table

project_data%>%select(REFNUM) %>%distinct() %>% nrow()

## [1] 902297

project_data%>%nrow()

## [1] 902297

As we can see, the full table has an id column called REFNUM. It is the same as the one in current_data. So now we can join by that column:

 current_data %>%  select(contains("CROP"),contains("DMG"),contains("DAMAGE"),contains("REFNUM")) ->econ_data

 project_data<-left_join(project_data,econ_data,by=c("REFNUM"="REFNUM")) 
rm(current_data)
rm(econ_data)
 gc(full = TRUE)

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  1487995  79.5    2759305  147.4   1946081  104.0
## Vcells 69192050 527.9  212395949 1620.5 212395105 1620.5

This leaves project_data with all its original rows and the value new columns. Where the refnum matched, it will have the corresponding DAMAGE values. Where it did not, the value for those columns will be NA.

Cleaning of the Event Type column

Both questions we are attempting to answer require us to categorize events by the EVTYPE variable content. Lets have a look at it:

project_data %>% distinct(EVTYPE) %>% arrange(EVTYPE)

It seems that the event types repeat themselves in meaning. For example:

Sometimes by letter case and plurals: - Frost/Freeze
- FROST/FREEZE - HAIL/WIND
- HAIL/WINDS

And that they encode subcategories at some points, separated by a “/” character: - DUST STORM
- DUST STORM/HIGH WINDS

Some are invalid characters or present troublesome features.

The variation seems very significant, as there are close to 1000 distinct event types that seem to have a lot of capture errors.

Lets see if some of them are more or less important for our purposes:

library(dplyr)
project_data%>%
  pull(EVTYPE) %>% 
  table %>% 
  sort(decreasing = TRUE) %>% head(30)

## .
##                     HAIL                TSTM WIND        THUNDERSTORM WIND 
##                   288661                   219944                    82563 
##                  TORNADO              FLASH FLOOD                    FLOOD 
##                    60652                    54278                    25326 
##       THUNDERSTORM WINDS                HIGH WIND                LIGHTNING 
##                    20843                    20212                    15755 
##               HEAVY SNOW               HEAVY RAIN             WINTER STORM 
##                    15708                    11723                    11433 
##           WINTER WEATHER             FUNNEL CLOUD         MARINE TSTM WIND 
##                     7026                     6839                     6175 
## MARINE THUNDERSTORM WIND               WATERSPOUT              STRONG WIND 
##                     5812                     3797                     3566 
##     URBAN/SML STREAM FLD                 WILDFIRE                 BLIZZARD 
##                     3392                     2761                     2719 
##                  DROUGHT                ICE STORM           EXCESSIVE HEAT 
##                     2488                     2006                     1678 
##               HIGH WINDS         WILD/FOREST FIRE             FROST/FREEZE 
##                     1533                     1457                     1342 
##                DENSE FOG       WINTER WEATHER/MIX           TSTM WIND/HAIL 
##                     1293                     1104                     1028

So some events are more frequent than others, but as we are interested in economic value and health data, we cannot just reduce this by eliminating the less frequent ones.

Fortunately, there is an authoritative table of event types in the documentation for the NOAA dataset downloaded at: https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf. In section 2, it is stated:

Permitted Storm Data Events.The only events permitted in Storm Data are listed in Table 1 of Section 2.1.1. The chosen event name should be the one that most accurately describes the meteorological event leading to fatalities, injuries, damage, etc. However, significant events, such as tornadoes, having no impact or causing no damage, should also be included in Storm Data. See Section 7 for detailed examples.

Said table in the PDF encodes very similarly named, but not exactly, events as we have seen in the data. This is a screenshot of it.

Event Table Note: i did not regard this screenshot as a “figure”. It is evidence of due diligence and documentation

I encoded it in a csv file called event_table.csv, which I will load now

authorized_types<-read_csv("./project_data/event_types.csv",col_types=cols(
  event_name = col_character()
),trim_ws=TRUE)
head(authorized_types)

Idealy, we would want to keep any and all rows from our economically significant dataset in health_data. We will try several strategies for this. The first thing is to know the full economic damage. This is straightforward:

sum(project_data$TOTAL_DAMAGE,na.rm=TRUE)%>%prettyNum(big.mark = ",")

## [1] "261,470,112,580"

Whatever we do with the EVTYPEs, we must account for that much of economic value.

A similar meassure of health data impact by category can be achieved by adding injuries and fatalities:

project_data%>%mutate(TOTAL_HEALTH_PROBLEMS=INJURIES+FATALITIES) ->project_data
sum(project_data$TOTAL_HEALTH_PROBLEMS)%>%prettyNum(big.mark = ",")

## [1] "155,673"

So this two figures are the total economic value and the total health impact respectively. Whatever we do to the categorize EVTYPES better, the total categorized values should approach this and still preserve some insight of the categories used.

Authorized event names vs recorded event types

Lets see how many authorized names are there and compare it to the unique names on the table:

project_data%>%distinct(EVTYPE)->orig_event_types
nrow(orig_event_types)

## [1] 977

nrow(authorized_types)

## [1] 48

How much of the ones in the original actually match the authorized exactly? The first thing will be to have a column in both that is of the same case, to compare them directly and maybe join them:

orig_event_types%>%mutate(upper=toupper(EVTYPE))->orig_event_types_m
authorized_types%>%mutate(upper=toupper(event_name))->authorized_types_m

Now…hopefully, some of these will match as much as possible from both tables. Lets see:

right_join(authorized_types_m,orig_event_types_m,suffix=c(".auth",".orig"),
           by=c("upper"="upper"))->events_joined
nrow(events_joined)

## [1] 977

head(events_joined)

Now we have this events_joined table.

Wherever there is a NA in the event_name, this means the event type name only appears in the original EVTYPE field of the NOAA data, but it did not match exactly the authorized name from the documentation.

Lets see how much injury or economic data we might loose if we only kept the reccords with an exact authorized name on them:

events_joined %>% filter(!is.na(event_name))%>%pull(EVTYPE) ->events_allowed
project_data%>%summarise(ALL_ECONDAMAGE=sum(TOTAL_DAMAGE,na.rm=T),
            ALL_HEALTHDAMAGE=sum(TOTAL_HEALTH_PROBLEMS,na.rm=T))->total_damages_sum
total_damages_sum %>%prettyNum(",")

##    ALL_ECONDAMAGE  ALL_HEALTHDAMAGE 
## "261,470,112,580"         "155,673"

project_data%>%
  filter(EVTYPE %in% events_allowed)%>%
  #Add up the economic and health values of the subset NOT matched directly
  #by the authorized event names table.
  summarise(SUBSET_ECONDAMAGE_pct=sum(TOTAL_DAMAGE,na.rm=T),
            SUBSET_HEALTHDAMAGE=sum(TOTAL_HEALTH_PROBLEMS,na.rm=T))%>%
  prettyNum(",")

## SUBSET_ECONDAMAGE_pct   SUBSET_HEALTHDAMAGE 
##     "202,242,822,970"             "139,753"

Thats too much to loose. We need to refine all of this data to group correctly.

So we will need better strategies to match this names. Lets look at events_joined, (which tells us which events match, dont match and come from each table) more closely.

events_joined%>% mutate(event_name=toupper(event_name))%>% arrange(upper)

So we can see where most of the errors are coming from. This is a list of some missmatches. Except where otherwise indicated, the left hand side of each missmatch is the authorized version and the right is the string that should have matched but didnt because its not an exact match.

These are examples of non-matches that seem to be easy to match by less strict matching:

ASTRONOMICAL LOW TIDE vs. ASTRONOMICAL HIGH TIDE
COASTAL FLOOD vs COASTAL FLOODING
EXTREME COLD/WIND CHILL vs EXTREME COLD
FLASH FLOOD vs FLASH FLOOD/FLOOD, FLASH FLOODING, FLASH FLOODING/FLOOD
FLOOD vs FLOOD/FLASH FLOOD,FLOODING,FLOODS,RIVER FLOOD,RIVER FLOODING
FROST/FREEZE vs FREEZE, FREEZING FOG
HAIL vs HAIL 100, HAIL/WIND, HAIL/WINDS, SMALL HAIL
HEAT vs HEAT WAVE,HEAT WAVE DROUGHT
HEAVY RAIN vs HEAVY RAIN/HIGH SURF, HEAVY RAINS,HEAVY RAINS/FLOODING
HEAVY SNOW vs HEAVY SNOW/HIGH WINDS & FLOOD
HIGH WIND vs HIGH WIND AND SEAS,HIGH WINDS,HIGH WINDS HEAVY RAINS,HIGH WINDS/COLD
HURRICANE (TYPHOON) vs. HURRICANE ERIN, HURRICANE FELIX, HURRICANE OPAL,HURRICANE OPAL/HIGH WINDS, HURRICANE/TYPHOON

Discussion on further strategies for cleaning of event types.

From the above, we can say that the authorized event names table still has some unfortunate, for us, characteristics. For example, HURRICANE (TYPHOON), which is an expression of synonymity, prevents matching for anything containing the word hurricane if we are testing by equality of strings. Also, EXTREME COLD/WIND CHILL presents the same kind of problem.

As a solution I will start on the authorized_types_m dataframe. Its “upper” column contains the authorized name in uppercase. Using that I will create a “matcher” column which will contain a regular expression to match for either word in synonyms. Thus HURRICANE (TYPHOON) will become (HURRICANE)|(TYPHOON), which will match anything in the EVTYPE column of the dataset, with the words HURRICANE or TYPHOON in any order or context.

We will do the same with authorized names that use the “/” character.

Recall that we have two dataframes for this work. The orig_events_type_m has all the unique names of the EVTYPE columns and an upper column for them.

orig_event_types_m%>%str

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 977 obs. of  2 variables:
##  $ EVTYPE: chr  "TORNADO" "TSTM WIND" "HAIL" "FREEZING RAIN" ...
##  $ upper : chr  "TORNADO" "TSTM WIND" "HAIL" "FREEZING RAIN" ...

From the authorized names table, we constructed authorized_types_m. It has the authorized name in the event_name column and an upper column, which has the same thing but in uppercase.

authorized_types_m %>% str()

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 48 obs. of  2 variables:
##  $ event_name: chr  "Astronomical Low Tide" "Avalanche" "Blizzard" "Coastal Flood" ...
##  $ upper     : chr  "ASTRONOMICAL LOW TIDE" "AVALANCHE" "BLIZZARD" "COASTAL FLOOD" ...

Fuzzy regex join strategy

It is this last one to which we add the matcher column with a regular expression that tries to capture all or most of the related names in the upper column of the orig_event_types_m:

authorized_types_m %>% mutate(matcher=
                                str_replace_all(
                                trimws(upper), 
                                
                                c(
                                  #replaces parenthezied names with an "or" regex
                                  "([^\\( ]*) \\(([^\\)]*)\\)"="(\\1|\\2)",
                                  #replaces "/" names with an "or" regex
                                  "(.*)/(.*)"="(\\1|\\2)")
                                )) ->authorized_types_m
authorized_types_m

Lets see if we can match better with this matcher column and a regular expression matcher for a join operation.

This can be accomplished by the fuzzyjoin::regex_*_join family of functions.

First we will obtain all the events that actually did match with an equal join, so we may remove them from the regex join:

inner_join(orig_event_types_m,authorized_types_m,c(upper="upper"))%>%
  mutate(match_type="equal")-> straight_equal_joined 
#this variable has all events that have identical names in both tables
#We will now create another table that will join by regex, but only
#with the events that were not matched by the inner_join
already_matched <- straight_equal_joined$upper

Now we filter those already matched and do our regex join:

orig_event_types_m%>%
  filter(!upper %in% already_matched)%>%
  regex_inner_join(.,authorized_types_m,
                 by=c(
                   upper="matcher")
                 ) %>%
  mutate(match_type="regex")-> regex_joined

Now we have two tables: straight_equal_joined and regex_joined. Each has a different subset of the full set of unique EVTYPE names, matched to the authorized events. We want that set to be one table but, to do that, they have to have the same columns so we can row_bind them.

#We rename the columns to have the column names
straight_equal_joined %>% select(
                            #The original EVTYPE column name
                            EVTYPE_NAME=EVTYPE,
                            #The authorized name in UPPERCASE
                            UPPER_AUTHORIZED=upper,
                            #The authorized name as in the docs
                            AUTHORIZED=event_name,
                            MATCH_TYPE=match_type
                               ) -> straight_equal_joined

regex_joined %>% select(
                          EVTYPE_NAME=EVTYPE,
                          UPPER_AUTHORIZED=upper.y,
                          AUTHORIZED=event_name,
                          MATCH_TYPE=match_type
                          
)->regex_joined
regex_joined

So far so good. However, from the output, you can see the regex_joined table has the problem of duplicating rows whenever am EVTYPE is of a combined type. That is, when one EVTYPE name matches two authorized names.

This is a treatable issue. We will treat that type of event as a new category for grouping value. We will create a new event category which we will call “Compound”, whenever the event name in EVTYPE comes under the classification of more than one authorized type name. We will group damage for those events separately in the analysis.

Compund event types

Lets extract the duplicated names and see how we might reclassify them:

regex_joined[duplicated(regex_joined$EVTYPE_NAME),]$EVTYPE_NAME ->duplicated_names
regex_joined %>% mutate(is_duplicated=(EVTYPE_NAME %in% duplicated_names))->regex_joined
regex_joined %>% filter(is_duplicated)  -> only_duplicated_rows
only_duplicated_rows

Now, by looking at this, we still see names that are not really an expert classifying this as two types for the same event, but are capture issues:

EVTYPE_NAME	UPPER_AUTHORIZED
FLASH FLOODING	FLASH FLOOD
FLASH FLOODING	FLOOD

This should be classified as a plain FLASH FLOOD but because of the ING ending, it would be missclassified by us as a dual event compound. We will regard this as a special case.

Anotherone of the same problematic nature:

EVTYPE_NAME	UPPER_AUTHORIZED
EXTREME COLD	COLD/WIND CHILL
EXTREME COLD	EXTREME COLD/WIND CHILL

Should only match EXTREME COLD. Also, a problem with our regular expression approach.

These are the only two special cases in the name_equivalence table, with type regex, that yield a repeat authorized type classification, but would be missclassified as a “compound” event type as we propose.

We will “fix” this by creating a “special_cases” table, that will have the same columns but match_type=“by hand”, where we will extract only this strange and unique cases.

#Deal with the FLASH FLOODING special case
only_duplicated_rows%>% 
  #Select both rows
  filter(EVTYPE_NAME == "FLASH FLOODING")%>%
  #Change match type, we are doing it by hand
  mutate(MATCH_TYPE="by hand")%>%
  #Only keep the row that makes sense
  filter(UPPER_AUTHORIZED=="FLASH FLOOD")->special_cases_1

#Remove from the original
only_duplicated_rows %>% filter(EVTYPE_NAME != "FLASH FLOODING") ->only_duplicated_rows

#Deal with the EXTREME COLD special case
only_duplicated_rows %>% 
  filter(EVTYPE_NAME=="EXTREME COLD") %>% 
  mutate(MATCH_TYPE="by hand")%>%
  filter(UPPER_AUTHORIZED=="EXTREME COLD/WIND CHILL")->special_cases_2
only_duplicated_rows %>% filter(EVTYPE_NAME != "EXTREME COLD")->only_duplicated_rows

bind_rows(special_cases_1,special_cases_2) -> special_cases 
special_cases

special_cases now has the only cases we know do not meet our assumptions about event types. We have removed those cases from only_duplicated_rows, for which we still need a solution.

As with the straight_joined, regex_joined and special cases, we will create a compound_types dataset with the subset of only_duplicated_rows that dont have special_cases. We will keep the same columns as the regex_joined, straight_joined tables, because in the end we need exactly one table with all the correspondence between authorized names and unique EVTYPEs. The values of the columns in this new table are as follows:

EVTYPE_NAME: Will have the name of the event in EVTYPE
UPPER_AUTHORIZED: Will have the same value as AUTHORIZED, in uppercase
match_type: Will have “by hand: compound”
AUTHORIZED: Will have the string “Compound: [comma separated list of Authorized types the EVTYPE matches to]”

The following code accomplishes the described table:

only_duplicated_rows %>%
  select(EVTYPE_NAME,AUTHORIZED) %>%
  arrange(EVTYPE_NAME)%>%
  #Create a "key" column to uniquely identify EVTYPE_NAMExAUTHORIZED combinations
  mutate(key=paste0(EVTYPE_NAME,"_",AUTHORIZED))%>%
  #Spread will create columns for each combination in key
  #The value in each column will be the corresponding AUTHORIZED name for each
  #EVTYPE_NAME. It will place NA where that EVTYPE_NAME did not match the AUTHORIZED
  #and the name in AUTHORIZED where it did.
  spread(key,AUTHORIZED) %>%
  #Each row in this stage has a set of NAs or AUTHORIZED names that the EVTYPE matches to
  #Unite will paste all columns, separated by comma, including NA as "NA"
  unite("AUTHORIZED",-EVTYPE_NAME,sep = ",",remove=FALSE) %>%
  #All the NA will be a string "NA". We need to remove them
  mutate(AUTHORIZED=str_remove_all(AUTHORIZED,"NA,|NA|,NA"))%>%
  #As we stated, the AUTHORIZED column will encode the value as Compound: 
  #to help us identifythis sort of compound categorization further downstream
  mutate(AUTHORIZED=paste0("Compound: ",AUTHORIZED))%>%
  #We want the AUTHORIZED for each repeated EVTYPE_NAME. 
  #any other remaining column is ignored
  select(EVTYPE_NAME,AUTHORIZED) -> compound_evtypes

#We can now directly match all EVTYPE names that were matched by this strategy
#and extract them from the only_duplicated_rows table.
inner_join(compound_evtypes,only_duplicated_rows,
           by=c("EVTYPE_NAME"="EVTYPE_NAME")) %>% 
   mutate(UPPER_AUTHORIZED=toupper(AUTHORIZED.x),
           MATCH_TYPE="by hand: compound")%>%
  #since the join renames some columns, we rename:
   select(EVTYPE_NAME,
          AUTHORIZED=AUTHORIZED.x,
          UPPER_AUTHORIZED,
          MATCH_TYPE) %>%
  # #The above will still have repeated rows, although all repeated 
  # # will be identical. Thus, we use distinct()
  distinct()->compound_types

Now we can remove from regex_joined any records that were NOT duplicated:

#lets get only the real regex matches
regex_joined %>% filter(!is_duplicated) %>%select(-is_duplicated)-> regex_joined_only_nondup

Now we have a number of subsets of the unique EVTYPE names that have matched to the authorized names by several strategies. For straight equality, this is the subset:

straight_equal_joined

We also have the special_cases, which has a few outlier cases:

special_cases

We also have the regex matches that did not cause duplicates:

regex_joined_only_nondup

And finally, the compound types, which were regex matches that represent two or more authorized names for each EVTYPE:

compound_types

Together, this tables, should add up to the total of distinct names in the EVTYPES variable of the original database, as extracted in health_data:

bind_rows(straight_equal_joined,
regex_joined_only_nondup,
special_cases,
compound_types) -> evtype_authorized_translation
nrow(evtype_authorized_translation)

## [1] 515

nrow(distinct(project_data,EVTYPE))

## [1] 977

Grouping value by event types

As you can see, we are still missing quite a lot of names. Almost half of them. Lets look at how much dollar value or even health event information do this yet unclassified names can group. We need to know if we can ignore unclassified event types.

Lets calculate totals and percentage of damage value components:

project_data%>% ungroup()%>%summarise(
                total_damage=sum(TOTAL_DAMAGE,na.rm=T),
                crop_total=sum(CROP_DAMAGE,na.rm=T),
                property_total=sum(PROPERTY_DAMAGE,na.rm=T),
                total_health_prob=sum(TOTAL_HEALTH_PROBLEMS,na.rm=T),
                total_fatal=sum(FATALITIES,na.rm=T),
                total_injuries=sum(INJURIES,na.rm=T)
)->total_damages_sum

#Print out the totals:

total_damages_sum%>%prettyNum(",")

##      total_damage        crop_total    property_total total_health_prob 
## "261,470,112,580"  "30,716,558,210" "230,753,554,370"         "155,673" 
##       total_fatal    total_injuries 
##          "15,145"         "140,528"

And now lets calculate percentages vs total of each interesting variable when we only take into account EVTYPEs that we could NOT determine the kind of AUTHORIZED event name they belong to:

project_data%>%filter(! EVTYPE %in% evtype_authorized_translation$EVTYPE_NAME)%>% summarize(
                total_damage_uncat=sum(TOTAL_DAMAGE,na.rm=T),
                total_econ_percent=total_damage_uncat/total_damages_sum$total_damage*100,
                crop_total_uncat=sum(CROP_DAMAGE,na.rm=T),
                crop_percent=crop_total_uncat/total_damages_sum$crop_total*100,
                property_total_uncat=sum(PROPERTY_DAMAGE,na.rm=T),
                property_total_uncat_pct=property_total_uncat/total_damages_sum$property_total*100,
                total_health_prob_uncat=sum(TOTAL_HEALTH_PROBLEMS,na.rm=T),
                total_health_prob_uncat_pct=total_health_prob_uncat/total_damages_sum$total_health_prob*100,
                total_fatal_uncat=sum(FATALITIES,na.rm=T),
                total_fatal_uncat_pct=total_fatal_uncat/total_damages_sum$total_fatal*100,
                total_injuries_uncat=sum(INJURIES,na.rm=T),
                total_injuries_pct=total_injuries_uncat/total_damages_sum$total_injuries*100
          
          ) ->subtotals
subtotals%>%
  prettyNum(big.mark = ",")

##          total_damage_uncat          total_econ_percent 
##             "1,509,605,910"                 "0.5773531" 
##            crop_total_uncat                crop_percent 
##               "636,421,100"                  "2.071915" 
##        property_total_uncat    property_total_uncat_pct 
##               "873,184,810"                 "0.3784058" 
##     total_health_prob_uncat total_health_prob_uncat_pct 
##                    "10,360"                  "6.654975" 
##           total_fatal_uncat       total_fatal_uncat_pct 
##                       "881"                  "5.817101" 
##        total_injuries_uncat          total_injuries_pct 
##                     "9,479"                  "6.745275"

On the economics side, this is no problem. Not AUTHORIZED event types account for about 0.6% of the total economic value, 2.07% of the crop damage, 0.3% of the property damage.

On the health side, we are still on about 7% loss. Its still a bit too much, so lets try and pull out the top EVTYPES that we have not been able to classify automatically.

Cummulative damage ranking to select event types

project_data%>%filter(! EVTYPE %in% evtype_authorized_translation$EVTYPE_NAME)%>%
  group_by(EVTYPE)%>%
  summarize(
                total_damage_uncat=sum(TOTAL_DAMAGE,na.rm=T),
                total_econ_percent=total_damage_uncat/total_damages_sum$total_damage*100,
                crop_total_uncat=sum(CROP_DAMAGE,na.rm=T),
                crop_percent=crop_total_uncat/total_damages_sum$crop_total*100,
                property_total_uncat=sum(PROPERTY_DAMAGE,na.rm=T),
                property_total_uncat_pct=property_total_uncat/total_damages_sum$property_total*100,
                total_health_prob_uncat=sum(TOTAL_HEALTH_PROBLEMS,na.rm=T),
                total_health_prob_uncat_pct=total_health_prob_uncat/total_damages_sum$total_health_prob*100,
                total_fatal_uncat=sum(FATALITIES,na.rm=T),
                total_fatal_uncat_pct=total_fatal_uncat/total_damages_sum$total_fatal*100,
                total_injuries_uncat=sum(INJURIES,na.rm=T),
                total_injuries_pct=total_injuries_uncat/total_damages_sum$total_injuries*100
          
          )%>%
    mutate(dmg_pct_total=total_damage_uncat/sum(total_damage_uncat)*100,
           cum_dmg=cumsum(dmg_pct_total),
           hlth_pct_total=total_health_prob_uncat/sum(total_health_prob_uncat)*100,
           cum_hlth=cumsum(hlth_pct_total))%>%
  select(EVTYPE,dmg_pct_total,cum_dmg,hlth_pct_total,cum_hlth)%>%
  arrange(desc(hlth_pct_total,dmg_pct_total))->uncat_events_values
  uncat_events_values

This is the cummulative sum of the percentage of damage captured by event type, ordered by the sum of health percentages. As it is immediatly apparent, the top 10 names ordered by percentage of health data captured will capture 100% of the value for both health and economic data.

Cleanup of names and correction of selections

How will we recategorize these names? There are obvious missmatches that we can treat by hand:

EVTYPE	AUTHORIZED
TSTM	Thunderstorm Wind
FOG	Dense Fog
WILD/FOREST FIRE	Wildfire
GLAZE	Extreme Cold/Wind Chill
WILD FIRES	Wildfire
ICE	Extreme Cold/Wind Chill
WIND	High Winds
URBAN/SML STREAM FLD	Flood
LANDSLIDE	as is
WINTRY MIX	as is

Its straightforward to do this in code. We just need to construct a table that can we can bind_rows to evtype_authorized_translation:

tibble(
  EVTYPE_NAME=c("TSTM WIND",
                "FOG",
                "WILD/FOREST FIRE",
                "GLAZE",
                "WILD FIRES",
                "ICE",
                "WIND",
                "URBAN/SML STREAM FLD",
                "LANDSLIDE",
                "WINTRY MIX"),
  AUTHORIZED=c("Thunderstorm Wind",
               "Dense Fog",
               "Wildfire",
               "Extreme Cold/Wind Chill",
               "Wildfire",
               "Extreme Cold/Wind Chill",
               "High Winds",
               "Flood",
               "Landslide",
               "Wintry Mix")
)%>%mutate(UPPER_AUTHORIZED=toupper(AUTHORIZED),
           MATCH_TYPE="By hand: unmatched")%>%
  #Make sure to preserve the order as need in the translation table
  select(EVTYPE_NAME,UPPER_AUTHORIZED,AUTHORIZED,MATCH_TYPE)->final_unmatched
#Bind this new translation elements to the authorized_translation table

bind_rows(evtype_authorized_translation,final_unmatched)->evtype_authorized_translation

Unauthorized event names

Now, other event types will remain untranslated but now we know those really are not significant. We will classify them as “Unauthorized event name”:

project_data%>%
  dplyr::left_join(evtype_authorized_translation,
             by=c("EVTYPE"="EVTYPE_NAME")
             )%>% filter(
               is.na(AUTHORIZED)
             ) %>% select(EVTYPE)%>%
            distinct() %>%
  #recall that the evtype_authorized_translation has the columns:
            mutate(EVTYPE_NAME=EVTYPE,
                   AUTHORIZED="Unauthorized event name",
                   UPPER_AUTHORIZED=toupper(AUTHORIZED),
                   MATCH_TYPE="by hand, unauthorized") -> unauthorized_names

bind_rows(evtype_authorized_translation,unauthorized_names) ->evtype_authorized_translation
nrow(evtype_authorized_translation)

## [1] 977

project_data%>%select(EVTYPE) %>% distinct() %>% nrow()

## [1] 977

Now our translation table evtype_authorized_translation is the same length as the number of unique EVTYPEs in the database.

We now classify all rows by joining with our translation table and verify that the row numbers are the same.

project_data %>% nrow()

## [1] 902297

project_data %>% inner_join(evtype_authorized_translation,
                            by=c("EVTYPE"="EVTYPE_NAME")) -> project_data_translated
project_data_translated %>% nrow()

## [1] 902297

Very well then, we now have what we need to see economic and health damage by event category. To save memory, we will clean some of the variables.

rm(health_data)

## Warning in rm(health_data): object 'health_data' not found

rm(economic_data)

## Warning in rm(economic_data): object 'economic_data' not found

project_data<-project_data_translated
rm(project_data_translated)
gc(full=TRUE)

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  1520690  81.3    2759305  147.4   2759305  147.4
## Vcells 74287146 566.8  212395949 1620.5 212395105 1620.5

So the unique event types we have now in the AUTHORIZED column actually group a significant ammount of both economic and health data.

Results

All the data is now cleaned, and we can freely group damages by any event type that groups significant values.

We will now summarize those values per AUTHORIZED, which is the column where the authorized event type is:

#remember that total_damages_sum has the sum totals of every variable we care about

project_data%>%
  filter(TOTAL_DAMAGE>0 | TOTAL_HEALTH_PROBLEMS>0)%>%
  mutate(AUTHORIZED=str_replace_all(AUTHORIZED,"Compound.*","Compound"))%>%
  group_by(AUTHORIZED)%>%
  summarize(
                econ_damage_total=sum(TOTAL_DAMAGE,na.rm=T),
                econ_damage_pct=(econ_damage_total/total_damages_sum$total_damage)*100,
                econ_crop_total=sum(CROP_DAMAGE,na.rm=T),
                econ_crop_pct=(econ_crop_total/econ_damage_total)*100,
                econ_property_total=sum(PROPERTY_DAMAGE,na.rm=T),
                econ_property_pct=(econ_property_total/econ_damage_total)*100,
                
                health_incidents_total=sum(TOTAL_HEALTH_PROBLEMS,na.rm=T),
                health_incidents_pct=(health_incidents_total/total_damages_sum$total_health_prob)*100,
                health_fatalities_total=sum(FATALITIES,na.rm=T),
                health_fatalities_pct=(health_fatalities_total/health_incidents_total)*100,
                health_injuries_total=sum(INJURIES,na.rm=T),
                health_injuries_pct=(health_injuries_total/health_incidents_total)*100
          
          )%>%
  mutate_if(is.character,as_factor)->total_damages_global_percentages

This new summarized table is rather important as it encodes all our variables grouped by type and calculates the percentage contribution to each total for each type in each variable.

With it, we can extract the most valuable groups for each of our variables. Remember, we care about two results: economic damage and health problems. And of those, we have several variables crop, property, injuries and fatalities, with two type of values, which are total and pct

total_damages_global_percentages%>%ungroup()%>%
  #Only percentages
  gather("type",value,contains("pct"),contains("total"))%>%
  separate(type,c("result","variable","type_val"),sep="_")%>%
  mutate_if(is.character,forcats::as_factor)%>%
  na.omit()%>%
   group_by(AUTHORIZED,result,type_val,variable)%>%arrange(value)->total_damages_gathered
# View(total_damages_gathered)

Now that we have all our economic data summarized by event type and classified by type of damage (total, crop and property), we can get a picture of how that is distributed for the economic variables.

Economic total, property and crop damages, by event type

econ_facet_names<-c("crop"="Crop damage per event",
               "property"="Property damage per event",
               "damage" =  "Total damage per event" )

total_damages_gathered%>%ungroup()%>%
  filter(result=="econ" &
    str_detect(type_val,"total")&
           str_detect(variable,"crop|property|damage"))%>%
  na.omit()->total_econ_to_plot



total_econ_to_plot%>%  
ggplot(aes(x=fct_reorder(AUTHORIZED,value,.desc=FALSE),
             y=value,
             fill=value,
             position="fill"
             )
         )+
  geom_col()+
  coord_flip()+
  facet_wrap(.~variable,
             scales = "free",
             labeller = as_labeller(econ_facet_names))+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

A thing to notice here is the scale of dollar value. Crops are on the order of 1 with ten zeroes after at their worst, but property and damage scales go for an order of magnitude more. Having that in mind, we can conclude here that:

Floods are the most costly events
Hurricanes are the second worst
Tornados, hail and flash floods are also significant in both economic damage types
For crops:
- Ice storms, frost and freeze, drought and thunderstorms are all large contributors, while they dont appear as much in property damage.
For property:
- Tides are an interesting contributor that is not in crops damage.
For the total column, it is obvious that the top 15 events are the most costly of all events.

Injuries, fatalities and total health incidents by event type:

health_facet_names<-c("fatalities"="Fatalities per event",
               "injuries"="Injuries per event",
               "incidents" =  "Total health incidents per event" )

total_damages_gathered%>%ungroup()%>%
  filter(result=="health" &
    str_detect(type_val,"total")&
           str_detect(variable,"injuries|fatalities|incidents"))%>%
  na.omit()->total_health_to_plot

total_health_to_plot %>% 
  group_by(variable)%>%
  mutate(order_within=row_number())%>%
  arrange(order_within,value)%>%ungroup()->total_health_to_plot

total_health_to_plot%>%  
  filter(value!=0)%>%
ggplot(aes(x=fct_reorder(AUTHORIZED,value,.desc=FALSE),
             y=value,
             fill=value,
             position="fill"
             )
         )+
  geom_col()+
  coord_flip()+
  facet_wrap(.~variable,
             scales = "free",
             labeller = as_labeller(health_facet_names))+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Conversely, we can see here fatalities and injuries are also in different scales for value. However, it is easy to see that even it differs how much of one or the other do we have per event, most of them are the same events.

Worst events in general The top 15 events in the total damages column is mostly the same, although with different ranking if you view them by injuries or fatalities.
Top killers: Avalanches, high surf, Cold/Wind chill, landslide, storm surge/tide are killers that dont seem to conrrespond in the injuries.
Injurers: The top 15 events shown are the top injurers.

Contribution by U.S. state to total health and economic damage.

As a final view of all damages across the U.S., we now present the total contribution of each state, by event type and type of damage, as a percentage of total damages summarized as economic and health damages, to see how they all rank by that meassure.

project_data %>% group_by(AUTHORIZED,STATE) %>% summarize(INJURIES=sum(INJURIES,na.rm=TRUE),
                                                          FATALITIES=sum(FATALITIES,na.rm=TRUE),
                                                          TOTAL_HEALTH_PROBLEMS=INJURIES+FATALITIES,
                                                          CROP_DAMAGE=sum(CROP_DAMAGE),
                                                          PROPERTY_DAMAGE=sum(PROPERTY_DAMAGE),
                                                          TOTAL_DAMAGE=CROP_DAMAGE+PROPERTY_DAMAGE,
                                                          TOTAL_ALL=TOTAL_HEALTH_PROBLEMS+TOTAL_DAMAGE) %>%
    filter(TOTAL_ALL>0)%>%ungroup() %>% mutate(PCT_INJ=percent_rank(INJURIES),P_FAT=percent_rank(FATALITIES),
                                               PCT_PROPERTY=percent_rank(PROPERTY_DAMAGE),
                                               PCT_CROP=percent_rank(CROP_DAMAGE),
                                               PCT_HTOT=percent_rank(TOTAL_HEALTH_PROBLEMS),
                                               PCT_TOTDMG=percent_rank(TOTAL_DAMAGE),
                                               STATE=as_factor(STATE),
                                               AUTHORIZED=as_factor(AUTHORIZED)) ->by_state_contribution
  
  by_state_contribution %>%
    select(AUTHORIZED,STATE,matches("PCT_.*"))%>% 
    gather("type_dmg","value",matches("PCT_.*")) %>%
    filter(value>0)%>%
    mutate(type_dmg=as_factor(type_dmg))-> percentages_by_state_and_event
  
  percentages_by_state_and_event %>% 
    spread(key = type_dmg,value = value)%>%mutate_at(vars(matches("PCT_*")), ~ifelse(is.na(.),0,.))->percentages_spread_and_cleaned
  percentages_spread_and_cleaned%>%
    mutate(AUTHORIZED=stringr::str_replace_all(AUTHORIZED,"Compound:","Cpd:"))%>%
    ggplot(aes(PCT_TOTDMG,
               PCT_HTOT,
               color=STATE))+geom_point()+facet_wrap(AUTHORIZED~.)+
    geom_label(label=percentages_spread_and_cleaned$STATE,nudge_x = 0.1,nudge_y = 0.15)+
    ggtitle("Health problems vs. Economic Damage",subtitle = "Contributing States Per Event Type")+
    xlab("Percentage of compounded (crop+property) economic damage")+
    ylab("Percentage of compounded (injuries+fatalities) health damage")+
    theme(legend.position="none")

So the more to the right a state label appears, the more percentage of economic damage it contributed to the total damage for that meteorological event type. The more to the top, the more health damage happened in the given state for that event type.

Conclussion

With this final graph and the other two supporting plots, a policy maker might rank event types as the most dangerous by injury or fatality, the most economically onerous by property or crop and, finally, where (which states), have suffered the most of the compound health (injuries+deaths) and economic (property+crops) damage.

NOAA Storm Database Analysis

Synopsis

Data Processing

Data Processing

Data cleaning, focus on economic damage

The economic damage multiplier columns

Cleaning of the Event Type column

Authorized event names vs recorded event types

Discussion on further strategies for cleaning of event types.

Fuzzy regex join strategy

Compund event types

Grouping value by event types

Cummulative damage ranking to select event types

Cleanup of names and correction of selections

Unauthorized event names

Results

Economic total, property and crop damages, by event type

Injuries, fatalities and total health incidents by event type:

Contribution by U.S. state to total health and economic damage.

Conclussion