Synopsis

In order to identify severe weather events in the United States to be prioritized for resource allocation and damage mitigation, a data analysis is conducted using the Storm Events Database produced by the U.S. National Oceanic and Atmospheric Administration (NOAA). The dataset is filtered, cleaned, and standardized into 43 unique weather event types, and the damages caused by each event type are summarized to draw a picture of the event’s total population health and economic damages across the United States. The top 10 events for both types of damages are identified, accounting for more than 90% of all damages caused by severe weather events. As a result, the government can prioritize efforts for damage prevention on these event types to reduce fatalities, injuries, and economic consequences of severe weather events.

Objectives

Given the U.S. NOAA Storm Events Database provided here, there are two key questions we seek to answer:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

Data Processing

First, we will import the necessary R packages for this analysis. Then, each data processing step will be marked with numbers, such as 1, 2, 3, and so on.

# For data processing
library(magrittr, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)
library(stringr, warn.conflicts = FALSE)

# For data visualization
library(ggplot2, warn.conflicts = FALSE)
library(scales, warn.conflicts = FALSE)
library(patchwork, warn.conflicts = FALSE)

# For prettifying tables
library(knitr, warn.conflicts = FALSE)

1. Read data

We begin this analysis by importing NOAA Storm Events Database, which contains:

A. Records of significant weather phenomena/events since January 1950.

B. Associated damages (fatalities/injuries/economic) attributed to these phenomena/events.

More information about the database can be found here, although the version we use here is older than what is currently hosted by NOAA. The version utilized here includes records up to November 2011.

# Download file, if the file is not found locally
if (!file.exists("StormData.csv.bz2")) {
  download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile = "./StormData.csv.bz2", method = "curl")
}

# Read data, and display a few info about it
data <- read.csv("StormData.csv.bz2")

cat("Finished importing StormData.csv.bz2!", sep="\n")
## Finished importing StormData.csv.bz2!
print(str_glue("Total rows: ", nrow(data)))
## Total rows: 902297
print(str_glue("Total columns: ", ncol(data)))
## Total columns: 37

2. Select and transform columns

We will reduce the number of columns in the dataset to retain only the information we will be using in this analysis and allow better focus during the course of analysis.

After carefully checking the available columns, only these columns will be retained:

A. EVTYPE as the types of events being recorded.

B. FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, and CROPDMGEXP as columns related to damages.

In addition, BGN_DATE, REMARKS, and REFNUM will be retained temporarily due to their usage in more data transformation later. To be exact, BGN_DATE is picked over END_DATE because all events have BGN_DATE, but not all are guaranteed to have END_DATE. These columns will be discarded after data processing associated with them is finished.

main_data <- data %>% select(BGN_DATE, EVTYPE, FATALITIES:CROPDMGEXP, REMARKS, REFNUM)

print(str_glue("Number of columns left: ", ncol(main_data)))
## Number of columns left: 10

Next, we will transform the columns to make further processing and analysis easier:

A. All event types (EVTYPE) will be converted to upper case with extra spaces removed for consistency.

B. BGN_DATE will be parsed using lubridate to convert the values into proper date-time objects.

C. A YEAR column will be added from BGN_DATE.

main_data <- main_data %>% 
  mutate(EVTYPE = EVTYPE %>% str_squish %>% str_to_upper, 
         BGN_DATE = mdy_hms(BGN_DATE), YEAR = year(BGN_DATE)) %>%
  relocate(YEAR, .before = BGN_DATE)

head(main_data) %>% kable
YEAR BGN_DATE EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP REMARKS REFNUM
1950 1950-04-18 TORNADO 0 15 25.0 K 0 1
1950 1950-04-18 TORNADO 0 0 2.5 K 0 2
1951 1951-02-20 TORNADO 0 2 25.0 K 0 3
1951 1951-06-08 TORNADO 0 2 2.5 K 0 4
1951 1951-11-15 TORNADO 0 2 2.5 K 0 5
1951 1951-11-15 TORNADO 0 6 2.5 K 0 6

3. Filter rows

In terms of rows, we will reduce the number of rows for the sake of better accuracy and fairness in comparison. Read on below to find out the reasons and keep in mind that more recent years are considered more complete in the instruction. Now, after checking the documentations on NOAA’s website1 2 and cross-validating with the dataset, it seems that:

A. Up to 1992, the database only recorded three types of events (Tornado, Thunderstorm Wind, and Hail), while other types of events are not recorded, which might explains why the database is still named Storm Events Database to this day. As a result, we cannot use data from before 1993 if we intend to keep a fair comparison between all types of events that we know and are being recorded today (officially, around ~48 event types in the documentation), because including data before 1993 means all other events will be underrepresented in the analysis, affecting the results.

num_of_events <- main_data %>% filter(YEAR <= 1992) %>%
  group_by(YEAR) %>% summarize(n = length(unique(EVTYPE)))

print(str_glue("Number of unique event types (<= 1992): ", num_of_events %>% pull(n) %>% max))
## Number of unique event types (<= 1992): 3

B. From 1993-1995, the records were extracted from unformatted text files, and a best effort attempt was done to extract information from the files. As noted in the documentation, there are many typos and inconsistencies in how the event types were recorded (potential sources of inaccuracies), and data from June-July 1993 is missing.

In addition, these are the years where there are unexpected characters within -EXP columns recording the “exponential”/magnitude of economic damages, where the real meanings of these characters are largely unknown (see Appendix A for potential interpretation, none of them certain), which can potentially introduce large distortions in the results if it turns out the true values are substantial.

exp_chars <- main_data %>% filter(between(YEAR, 1993, 1995)) %$% c(PROPDMGEXP, CROPDMGEXP)
exp_chars[!str_to_upper(exp_chars) %in% c("B", "K", "M", "")] %>% table
## .
##   -   ?   +   0   1   2   3   4   5   6   7   8   h   H 
##   1  15   5 234  25  14   4   4  28   4   5   1   1   6

Many of these problems are corrected by NOAA in 2013, but the corrections are not present in the version we use. As a result, for the sake of better accuracy (because both -EXP and EVTYPE seem to have noteworthy uncertainties during this time period), we will exclude these years and base our analysis on the records made after database software was used to store the data (>= 1996).

C. If we want the most accurate event classification, the 48 event types as we know them today exist after 2004 (there are still non-standard classification such as “PROLONG COLD” in 2003), which means data after 2004 is likely to be in the best shape to be analyzed within this dataset.

main_data %>% filter(YEAR >= 1996) %>% distinct(YEAR, EVTYPE) %>% pull(YEAR) %>% table
## .
## 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 
##  199  135  122  118  110  120   99   51   38   46   50   46   46   46   46   46

However, we will only have 8 years of data if we only use data after 2004 (that is, 2004-2011). Because more data is better in this kind of analysis (at least, it would be preferable if we have at least a decade of data), we will not exclude more data due to event type reasons beyond the 3 years (1992-1995) we have excluded.

Finally, we can see that our latest data is November 2011. Because severe weather events likely have major seasonal influences in nature, it is important to match the months we use to avoid giving undue weights to certain stretch of the year in our calculation and comparison. As a result, we will use data between December 1996-November 2011 in order to represent each month exactly 15 times, ensuring that we do not give more weights or attention to particular months/seasons. Thus, our final data will be 15 years in length, which is more than the decade we desired, and we will filter the rows below.

main_data <- main_data %>% filter(BGN_DATE > "1996-12-01") %>% arrange(BGN_DATE)

# When displaying the data, we'll cut-off REMARKS to keep our previews short
head(main_data) %>% mutate(REMARKS = str_trunc(REMARKS, 30)) %>% kable
YEAR BGN_DATE EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP REMARKS REFNUM
1996 1996-12-01 LIGHTNING 0 0 50.0 K 0 Lightning struck a drug sto… 249360
1996 1996-12-01 TSTM WIND 0 0 1.5 K 0 Several trees were blown do… 249361
1996 1996-12-01 FLASH FLOOD 0 0 8.0 K 0 K Streets were reported flood… 249362
1996 1996-12-01 TSTM WIND 0 0 15.0 K 2 K A storage building lost a r… 249363
1996 1996-12-01 TSTM WIND 0 0 20.0 K 2 K Roof damage to houses along… 249364
1996 1996-12-01 TORNADO 0 0 80.0 K 2 K A small tornado touched dow… 249365

4. Calculate population health harm & economic damages

We will calculate the four types of damages recorded by NOAA’s Storm Events Database (fatalities, injuries, property damages, and crop damages), and then combine them into two types of damages according to the objectives of this report: population health and economic damages.

Here is a breakdown of the necessary transformation:

A. FATALITIES and INJURIES need no further calculation, but they will be added together into POPHEALTHDMG.

B. PROPDMG and CROPDMG need to be multiplied with the appropriate magnitude using PROPDMGEXP and CROPDMGEXP, respectively, and added together into ECONDMG. The -EXP columns are translated after cross-checking with the documentation and examining the data, then they will be discarded when no longer needed.

dmg_exp_conv <- tibble(DMGEXP = c("", "0", "K", "M", "B"),
                       DMGEXPCONV = c(1, 1, 10 ^ 3, 10 ^ 6, 10 ^ 9))

main_data <- main_data %>% 
  left_join(dmg_exp_conv, by = join_by(PROPDMGEXP == DMGEXP)) %>% 
    mutate(PROPDMG = PROPDMG * DMGEXPCONV) %>% 
    select(-DMGEXPCONV, -PROPDMGEXP) %>%
  left_join(dmg_exp_conv, by = join_by(CROPDMGEXP == DMGEXP)) %>% 
    mutate(CROPDMG = CROPDMG * DMGEXPCONV) %>% 
    select(-DMGEXPCONV, -CROPDMGEXP) %>%
  mutate(POPHEALTHDMG = FATALITIES + INJURIES, 
         ECONDMG = PROPDMG + CROPDMG)

head(main_data %>% select(EVTYPE:CROPDMG, POPHEALTHDMG:ECONDMG)) %>% kable
EVTYPE FATALITIES INJURIES PROPDMG CROPDMG POPHEALTHDMG ECONDMG
LIGHTNING 0 0 50000 0 0 50000
TSTM WIND 0 0 1500 0 0 1500
FLASH FLOOD 0 0 8000 0 0 8000
TSTM WIND 0 0 15000 2000 0 17000
TSTM WIND 0 0 20000 2000 0 22000
TORNADO 0 0 80000 2000 0 82000

5. Clean event types

At last, we will standardize the event types that will be used in the analysis. EVTYPE is the last column that needs specific transformation before analysis can be initiated.

num_uniq_types <- main_data %>% pull(EVTYPE) %>% unique %>% length

print(str_glue("Number of unique event types: ", num_uniq_types))
## Number of unique event types: 323

There are 323 event types in the data, and most of them overlap with each other, making it difficult to produce an accurate analysis without grouping/standardizing them first. If the event type grouping already follows the 48 event types used in the given NOAA documentation, overlap will not be a concern, but we have found out earlier that this is only the case for year >= 2004. Thus, event types will need to be merged together and the overlaps sorted out before we have the event types we can use in the analysis later.

An important note: Due to ambiguity or some non-standardized event types having larger scopes than the ideal 48 event types, the final number of event types might not be 48 or be the same event types as defined in the documentation. Details about groupings will be discussed in Appendix B.

Event types are combined together using regular expression patterns written after studying the available event types, the included remarks describing the events, and the common definitions of the weather terms. The code is provided below, but keep in mind that it can be quite… long, due to its degree of specificity.

print(str_glue("Number of rows (before standardization): ", nrow(main_data)))
## Number of rows (before standardization): 622448
# This variable holds all cleaned data except for three event types
# which require slightly different methods to clean
# and will be cleaned separately later for convenience
std_data_main <- main_data %>% 
  filter(!EVTYPE %in% c("MONTHLY RAINFALL", "MONTHLY PRECIPITATION", "OTHER")) %>%
  mutate(EVTYPE = case_when(
    str_detect(EVTYPE, "MARINE TSTM") ~ "MARINE THUNDERSTORM WIND",
    str_detect(EVTYPE, "(?<!MARINE )(?<!NON( |\\-))(TSTM|THUNDERSTORM|MICROBURST|(THUNDERSNOW)|(WALL CLOUD))") ~ "THUNDERSTORM WIND",
    str_detect(EVTYPE, "(FIRE|(RED FLAG))") ~ "WILDFIRE",
    str_detect(EVTYPE, "(?<!MARINE )HAIL") ~ "HAIL",
    str_detect(EVTYPE, "((CHILL|(((EXTREME|EXCESSIVE|UNSEASONABLY|UNUSUALLY|RECORD|PROLONG) )?COLD( (WEATHER|TEMPERATURES))?))|(LOW TEMP)|COOL|HYPOTHERMIA|(MONTHLY TEMPERATURE))(?! AND)") ~ "COLD/WIND CHILL",
    str_detect(EVTYPE, "(HEAT|WARM|HOT|HYPERTHERMIA|(RECORD HIGH)|(TEMPERATURE RECORD)|(RECORD TEMPERATURE))") ~ "EXCESSIVE/UNUSUAL HEAT",
    str_detect(EVTYPE, "(((HEAVY|EXCESSIVE|PROLONGED|UNSEASONAL|RECORD|EARLY) ((RAIN(FALL)?)|PRECIPITATION))|(RAIN .HEAVY.)|WET|DROWNING)") ~ "EXCESSIVE/UNUSUAL RAIN",
    str_detect(EVTYPE, "(DROUGHT|DRY|DRIEST|(RECORD LOW))") ~ "DRYNESS & DROUGHT",
    str_detect(EVTYPE, "(SURF|(HIGH SWELLS))") ~ "HIGH SURF",
    str_detect(EVTYPE, "SLEET") ~ "SLEET",
    str_detect(EVTYPE, "(?<!(CSTL|COASTAL|LAKESHORE|TIDAL) ?)(FLOOD|FLD|(HIGH WATER)|(DAM BREAK))") ~ "FLOOD/HIGH WATER",
    str_detect(EVTYPE, "((CSTL|COASTAL|TIDAL|BEACH) ?(FLOOD|FLD|EROSION)|(ASTRONOMICAL HIGH TIDE))") ~ "COASTAL FLOOD & EROSION",
    str_detect(EVTYPE, "VOLCANIC") ~ "VOLCANIC ASH",
    str_detect(EVTYPE, "(HURRICANE|TYPHOON|FLOYD)") ~ "HURRICANE/TYPHOON",
    str_detect(EVTYPE, "(ROCK|MUD|LAND) ?(SLIDE|SLUMP)") ~ "DEBRIS FLOW",
    str_detect(EVTYPE, "LAKE(-| )EFFECT SNOW") ~ "LAKE-EFFECT SNOW",
    str_detect(EVTYPE, "RIP CURRENT") ~ "RIP CURRENT",
    str_detect(EVTYPE, "( SEA|WAVE|(MARINE ACCIDENT))(?!SON)") ~ "ROUGH/HEAVY WAVE",
    str_detect(EVTYPE, "(TORNADO|LANDSPOUT)") ~ "TORNADO",
    str_detect(EVTYPE, "(FREEZE|FROST|(IC[EY].*ROAD)|(PATCHY ICE)|(BLACK ICE)|GLAZE)") ~ "FROST/FREEZE/GLAZE",
    str_detect(EVTYPE, "COASTAL( )?STORM") ~ "TROPICAL STORM",
    str_detect(EVTYPE, "STORM SURGE") ~ "STORM SURGE/TIDE",
    str_detect(EVTYPE, "SMOKE") ~ "SMOKE",
    str_detect(EVTYPE, "[FV]OG") ~ "FOG",
    str_detect(EVTYPE, "((FREEZING (RAIN|DRIZZLE|PRECIP))|(MIXED PRECIP)|((WINTER|WINTRY|WINTERY)( WEATHER)?( (MIX|WEATHER)))|(SLEET)|(RAIN/SNOW)|(ICE PELLETS))(?! STORM)") ~ "WINTER WEATHER",
    str_detect(EVTYPE, "(?<!ICE/)SNOW") ~ "SNOW",
    str_detect(EVTYPE, "BLIZZARD") ~ "BLIZZARD",
    str_detect(EVTYPE, "((ICE STORM)|(SLEET STORM)|(WINTER STORM)|(ICE/SNOW))") ~ "ICE/WINTER STORM",
    str_detect(EVTYPE, "^ICE$") ~ "ICE/WINTER STORM",
    str_detect(EVTYPE, "WATERSPOUT") ~ "WATERSPOUT",
    str_detect(EVTYPE, "(?<!MARINE |LAKE )(STRONG|HIGH|GUSTY|LOW|GRADIENT|(NON.*)) WIND") ~ "WIND GUST",
    str_detect(EVTYPE, "^(WIND|WND).*$") ~ "WIND GUST",
    str_detect(EVTYPE, "((MARINE (STRONG|HIGH))|(GUSTY LAKE))") ~ "MARINE WIND GUST",
    str_detect(EVTYPE, "DUST DEV[IE]L") ~ "DUST DEVIL",
    str_detect(EVTYPE, "((BLOWING DUST)|(SAHARAN DUST))") ~ "DUST STORM",
    str_detect(EVTYPE, "((FUNNEL CLOUD)|WHIRLWIND)") ~ "FUNNEL CLOUD",
    str_detect(EVTYPE, "BLOW-OUT TIDE") ~ "ASTRONOMICAL LOW TIDE",
    .default = EVTYPE
  )) %>%
  filter(!str_detect(EVTYPE, "(SUMMARY|(NORTHERN LIGHTS)|NONE)"))

# MONTHLY PRECIPITATION has to be divided into two event types
# based on carefully selected keywords within the REMARKS column, because
# it can be either too much or too little precipitation (very different)
std_data_mprecip <- main_data %>%
  filter(EVTYPE %in% c("MONTHLY RAINFALL", "MONTHLY PRECIPITATION")) %>%
  mutate(REMARKS = REMARKS %>% str_to_lower,
         EVTYPE = if_else(str_detect(
             REMARKS,
             "(only|(less than an inch)|driest|(an inch or less))"),
           "DRYNESS & DROUGHT", "EXCESSIVE/UNUSUAL RAIN")
  )

# OTHER is a very ambiguous event type, so it can be divided into
# many different event types depending on the content of the REMARKS
std_data_other <- main_data %>% 
  filter(EVTYPE == "OTHER") %>% 
  mutate(REMARKS = REMARKS %>% str_to_lower,
         EVTYPE = case_when(
          str_detect(REMARKS, "dust( )?devil") ~ "DUST DEVIL",
          str_detect(REMARKS, "blizzard") ~ "BLIZZARD",
          str_detect(REMARKS, "excess rainfall") ~ "EXCESSIVE/UNUSUAL RAIN",
          str_detect(REMARKS, "lightning-caused fires") ~ "LIGHTNING",
          str_detect(REMARKS, "thunderstorm") ~ "THUNDERSTORM WIND",
          str_detect(REMARKS, "(wind|squall|gust)") ~ "WIND GUST",
          str_detect(REMARKS, "rogue wave") ~ "HIGH SURF",
          str_detect(REMARKS, "maximum temperature") ~ "EXCESSIVE/UNUSUAL HEAT"
      )
  )

# Compile them together into our standardized data
std_data <- bind_rows(std_data_main, std_data_mprecip, std_data_other) %>% 
  arrange(REFNUM)

# Print some info about the standardization process
num_of_events <- std_data %>% pull(EVTYPE) %>% unique %>% length

print(str_glue("Number of rows (after standardization): ", nrow(std_data)))
## Number of rows (after standardization): 622442
print(str_glue("Number of unique events (after standardization): ", num_of_events))
## Number of unique events (after standardization): 43

After grouping and merging, there are 43 event types left (remember, full details on how the decisions were made are available on Appendix B), and there are six less rows after the process. The missing six rows are rows with EVTYPE containing the words “SUMMARY”, “NORTHERN LIGHTS”, and “NONE”, which denote a summary of events already recorded in other event types or events that are not actually severe weather events.

6. Finalize clean data

Our data processing is almost done, and the last thing we need to do is discarding several columns that have been used for processing and no longer needed to produce the main analysis later (BGN_DATE, YEAR, REMARKS, REFNUM).

clean_data <- std_data %>% select(-(YEAR:BGN_DATE), -(REMARKS:REFNUM))

head(clean_data) %>% kable
EVTYPE FATALITIES INJURIES PROPDMG CROPDMG POPHEALTHDMG ECONDMG
LIGHTNING 0 0 50000 0 0 50000
THUNDERSTORM WIND 0 0 1500 0 0 1500
FLOOD/HIGH WATER 0 0 8000 0 0 8000
THUNDERSTORM WIND 0 0 15000 2000 0 17000
THUNDERSTORM WIND 0 0 20000 2000 0 22000
TORNADO 0 0 80000 2000 0 82000

Results

As a reminder, below are the two questions we need to answer in this report:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

In order to answer these questions, two visualizations will be produced showing the top 10 event types that contribute the damages required in the objectives. The visualizations are deemed to be sufficient to answer the objectives because the top 10 contributes ~90% and ~96% of the total population health and economic damages, respectively. By putting emphasis on the events with the most impact, the results will also help answer the other objective this report is meant to help with: prioritizing resources that will be used to prepare for severe weather events.

The analysis and visualizations will be provided in the next sections. Note that no specific recommendation is required for the report.

Population Health

In terms of population health, there are two possible variables in NOAA Storm Events Database that are relevant to the topic: fatalities and injuries. To have a single metric that captures both, fatalities and injuries are added together into a single variable named POPHEALTHDMG. This is the metric displayed in the visualization.

pophealthdmg <- clean_data %>%
  group_by(EVTYPE) %>% summarize(SUM_DMG = sum(POPHEALTHDMG), 
                                 AVG_DMG = mean(POPHEALTHDMG) %>% round(2),
                                 LABEL = str_glue(label_comma()(SUM_DMG), 
                                                  " victims (", AVG_DMG, " avg)")
                                 ) %>%
  arrange(-SUM_DMG) %>% slice_head(n = 10)

pophealth_plot <- ggplot(pophealthdmg, aes(reorder(EVTYPE, SUM_DMG), SUM_DMG, fill = SUM_DMG)) +
  geom_col(show.legend = FALSE) + 
  geom_text(aes(label = if_else(SUM_DMG > 10000, LABEL, NA)), color = "white", hjust = 1.1, size = 3, na.rm = TRUE) + 
  geom_label(aes(label = if_else(SUM_DMG <= 10000, LABEL, NA)), size = 3, fill = "#E2E3EF", colour = "black", label.size = NA, hjust = -0.1, na.rm = TRUE) + 
  labs(title = "Most Harmful Weather Events to Population Health", 
       subtitle = "Top 10 Weather Types, Damages Between 1996-2011", 
       x = NULL, y = "Total Fatalities & Injuries",
       caption = "Data from NOAA Storm Database (Dec '96–Nov '11). Avg is calculated per event.") +
  coord_flip() + 
  scale_y_continuous(labels = label_comma()) + 
  scale_fill_gradient(high = "#002142", low = "#005FA3")

pophealth_plot

As mentioned before, the top 10 events contribute around ~90% of all damages caused by weather events between December 1996 to November 2011. From the figure, we can see that Tornado is the event type with the most fatalities and injuries by a large margin, while the event type with the highest average victim per event is Hurricane/Typhoon. As a result, although it is true that it would be more useful to prepare for events with large damage total like Tornado, local governments should be especially cautious when events like Hurricane/Typhoon with high average happen, as it may claim more victims in one incident once it happens.

Provided below is the complete tally for the top 10 events. Note how there are overwhelmingly more injuries than fatalities in all events, except for Rip Current. It is very difficult to escape a rip current once it has caught a person, which may explain the fatality number.

clean_data %>%
  group_by(EVTYPE) %>% summarize(n = n(),
                                 Injuries = sum(INJURIES),
                                 Fatalities = sum(FATALITIES),
                                 Total = sum(POPHEALTHDMG), 
                                 Average = mean(POPHEALTHDMG) %>% round(2)) %>%
  arrange(-Total) %>% slice_head(n = 10) %>% mutate(`#` = row_number(), .before = EVTYPE) %>%
  kable
# EVTYPE n Injuries Fatalities Total Average
1 TORNADO 21933 19963 1486 21449 0.98
2 FLOOD/HIGH WATER 74507 8418 1206 9624 0.13
3 EXCESSIVE/UNUSUAL HEAT 2696 7573 2001 9574 3.55
4 THUNDERSTORM WIND 201408 4821 359 5180 0.03
5 LIGHTNING 12300 3828 599 4427 0.36
6 WIND GUST 23163 1397 340 1737 0.07
7 ICE/WINTER STORM 12658 1299 243 1542 0.12
8 WILDFIRE 4118 1425 87 1512 0.37
9 HURRICANE/TYPHOON 228 1309 89 1398 6.13
10 RIP CURRENT 703 438 517 955 1.36

Economic Consequences

In terms of economic consequences, there are two possible variables in NOAA Storm Events Database that are relevant to the topic: property and crop damages. To have a single metric that captures both, property and crop damages are added together into a single variable named ECONDMG. This is the metric displayed in the visualization.

label_dollar_abbrev <- label_currency(scale_cut = cut_short_scale())

econdmg <- clean_data %>%
  group_by(EVTYPE) %>% summarize(SUM_DMG = sum(ECONDMG), 
                                 AVG_DMG = mean(ECONDMG) %>% round(2),
                                 LABEL = str_glue(label_dollar_abbrev(SUM_DMG), 
                                                  " damages (", label_dollar_abbrev(AVG_DMG), 
                                                  " avg)")
                                 ) %>%
  arrange(-SUM_DMG) %>% slice_head(n = 10)

econ_plot <- ggplot(econdmg, aes(reorder(EVTYPE, SUM_DMG), SUM_DMG, fill = SUM_DMG)) +
  geom_col(show.legend = FALSE) + 
  geom_text(aes(label = if_else(SUM_DMG > 10 ^ 11, LABEL, NA)), color = "white", hjust = 1.1, size = 3, na.rm = TRUE) + 
  geom_label(aes(label = if_else(SUM_DMG <= 10 ^ 11, LABEL, NA)), size = 3, fill = "#E2E3EF", colour = "black", label.size = NA, hjust = -0.1, na.rm = TRUE) + 
  labs(title = "Weather Events with the Most Economic Consequences", 
       subtitle = "Top 10 Weather Types, Damages Between 1996-2011", 
       x = NULL, y = "Total Economic Damages (Property + Crop)",
       caption = "Data from NOAA Storm Database (Dec '96–Nov '11). Avg is calculated per event.") +
  coord_flip() + 
  scale_y_continuous(labels = label_dollar_abbrev) + 
  scale_fill_gradient(high = "#002142", low = "#005FA3")

econ_plot

As mentioned before, the top 10 events contribute around ~96% of all economic damages caused by weather events between December 1996 to November 2011, which is a larger proportion than the case with population health. Flood/High Water is the event type contributing the most damage at $163B, followed by Hurricane/Typhoon at $85B, and it is worth noting that Hurricane/Typhoon has the highest average damage at $374M too, followed by Storm Surge/Tide at $126M damages.

The complete tally will be given below, and we can see from the tally that Flood/High Water is a relatively “common” event compared to Hurricane/Typhoon (or Storm Surge/Tide), which likely helps explain the disparity in average damage that we see here. After all, Flood/High Water has many possible causes and includes many kinds of events, from large-scale flash flood to urban & small stream flood, which lets it have comparatively low average compared to Hurricane/Typhoon. In fact, the tally seems to suggest that rare top 10 events that have less than 1K incidents possess exceptionally high average damage when they occur.

clean_data %>%
  group_by(EVTYPE) %>% summarize(n = n(),
                                 Property = sum(PROPDMG),
                                 Crop = sum(CROPDMG),
                                 Total = sum(ECONDMG), 
                                 Average = mean(ECONDMG) %>% round(2)) %>%
  arrange(-Total) %>% slice_head(n = 10) %>% ungroup %>%
  mutate(`#` = row_number(), .before = EVTYPE) %>% 
  mutate(across(Property:Total, label_dollar_abbrev), Average = label_currency()(Average)) %>%
  kable
# EVTYPE n Property Crop Total Average
1 FLOOD/HIGH WATER 74507 $157.249B $5.93B $163.18B $2,190,158
2 HURRICANE/TYPHOON 228 $80.337B $5.00B $85.34B $374,284,920
3 STORM SURGE/TIDE 381 $47.827B $850K $47.83B $125,532,774
4 TORNADO 21933 $23.898B $271M $24.17B $1,101,959
5 HAIL 197111 $13.982B $2.31B $16.29B $82,669
6 DRYNESS & DROUGHT 2502 $919M $12.88B $13.80B $5,514,463
7 THUNDERSTORM WIND 201408 $7.508B $976M $8.48B $42,124
8 TROPICAL STORM 658 $7.596B $677M $8.27B $12,572,776
9 WILDFIRE 4118 $7.687B $399M $8.09B $1,963,509
10 WIND GUST 23163 $5.317B $592M $5.91B $255,120

Conclusion

Visualizations have been created to answer the two questions presented at the beginning of the report.

pophealth_plot1 <- pophealth_plot + labs(title = "Population Health Harm", subtitle = NULL, caption = NULL)
econ_plot1 <- econ_plot + labs(title = "Economic Damages", subtitle = NULL, caption = NULL)

pophealth_plot1 + econ_plot1 + 
  plot_annotation(title = "Top 10 Severe Weather Events by Damage Type", 
                  subtitle = "December 1996–November 2011",
                  caption = "Data from NOAA Storm Database.")

Based on the analysis, to reduce or mitigate damages caused by severe weather events, the government can prioritize their resources for the top 10 events displayed by the respective population health and economic damage visualizations. Because the top 10 events alone accounted for more than 90% of total damages, they are a good pick for priority given limited resources. The combined visualizations can be seen above.

Appendix

Appendix A: Possible Interpretation of -EXP Column Values

According to the given documentation (pg. 12), damage magnitude should be denoted with “K”, “M”, or “B”. Other than that, we might expect empty strings or other similar ways to denote that a magnitude indicator is not needed, when the damage value is too small to be given a magnitude. However, there are other mysterious characters present in data between 1993-1995, displayed below:

exp_chars <- data %>% filter(between(BGN_DATE %>% mdy_hms %>% year, 1993, 1995)) %$% c(PROPDMGEXP, CROPDMGEXP)
exp_chars[!str_to_upper(exp_chars) %in% c("B", "K", "M", "")] %>% table
## .
##   -   ?   +   0   1   2   3   4   5   6   7   8   h   H 
##   1  15   5 234  25  14   4   4  28   4   5   1   1   6

From all the characters we can see, “-” and “h”/“H” might be possible to explain as indicator for empty values or hundreds, respectively, and 0 could be something similar with “-”, but the rest is much more difficult to puzzle out. Suppose there is a PROPDMG of 12, if the PROPDMGEXP value is 2, does it mean the property damage is 12^2? Or 12 * 10^2? Or perhaps it is something else entirely?

There is something interesting in the 1993 annual Storm Data Publication (can be obtained here) that could hint what an integer with a single digit might mean for damage magnitude. A picture is not attached here (in case there will be someone who sees it and consider it a new figure exceeding the 3-figure limit), but more or less it reads as follows:

REFERENCE NOTES
STORM DATA CATAGORIES (sic)
1 Less than $50
2 $50 to $500

8 $50 Million to $500 Million
9 $500 Million to $5 Billion

Is that what the magnitude numbers mean? On the other hand, the dataset version that has to be used in this report is an old version, and the newest version currently hosted at NOAA’s website no longer has the -EXP columns, just the damage numbers directly that are easier to interpret. It is not easy to look up the current figures for each event with these magnitude numbers in the database, but one of the results seems to indicate that these numbers means 0; that is, instead of 12 * 10^2, the damage should be 12 * 10^0.

In conclusion, it is uncertain and ambiguous what these numbers in -EXP columns actually mean. It could be something large like $500 millions in the 1993 annual Storm Data Publication, or it could be as small as 0. The best way to deal with this would be downloading the newest version of the database which has the more accurate numbers, but we are currently stuck using this old version for now. Most of our data cleaning problems here probably do not exist anymore in the newest version of the dataset, yet it might feel a bit unfair if we skip all the problems while other people still have to deal with using the old version…

Appendix B: Merging & Grouping EVTYPE

From 323 unique event types, the event types are merged together until only 43 event types are left. While events are combined in a manner as close as possible to NOAA’s 48 event types, there are several event types in EVTYPE that are impossible to classify cleanly into one of NOAA’s event types due to ambiguity. This ambiguity may occur because one EVTYPE actually includes two or more standard event types at the same time, or it can also occur because remarks that may help classification also do not exist within that event type. Both these cases, where the classifications are done differently due to circumstances in the dataset, are documented below:

No EVTYPE Description
1 Cold/Wind Chill In NOAA’s event types, Cold/Wind Chill is actually distinguished into Cold/Wind Chill and Extreme Cold/Wind Chill. However, there are many non-standard event types (ex: “Cold”, “Cold Temperatures”) that do not identify whether it is just the normal Cold/Wind Chill or it is the Extreme type, so the two variants are merged in order to accomodate those non-standard event types.
2 Excessive/Unusual Heat The case is similar with Cold/Wind Chill, while the definition is also broadened with Unusual Heat since the dataset also contains those particular cases. Unusual heat may refer to situations where the temperature by itself is not very high, but it happens at an unusual time (ex: outside the expected months).
3 Excessive/Unusual Rain Originally, it is named as Heavy Rain in NOAA’s classification, but the definition (or at least the wording) is broadened to include unusual rain for similar reasons with Excessive/Unusual Heat.
4 Dryness & Drought Originally, it is named as Drought in NOAA’s classification. However, there are some non-standard event types that do not record outright drought, but simply dry conditions, so the definition is broadened here.
5 Flood/High Water Flood/High Water is a combination of Flood, Flash Flood, and a non-standard category called High Water. Flood and Flash Flood are combined due to similar reasons with Cold/Wind Chill, while High Water is a separate category where an outright flood does not occur, but the water (usually within a creek) is high enough to result in casualties.
6 Coastal Flood & Erosion Originally, it is only named as Coastal Flood in NOAA’s classification. However, the definition is broadened because there are non-standard event types dealing with beach erosion (ex: “Coastal Flooding/Erosion), and both may happen at the same time.
7 Rough/Heavy Wave The closest existing type to this would be High Surf, but High Surf is defined as “large waves on or near shore”, while this type is created to accommodate non-standard event types where the waves may not necessarily happen close to shores.
8 Frost/Freeze/Glaze Originally, it is only named as Frost/Freeze in NOAA’s classification. The definition is slightly broadened by adding Glaze to make it more descriptive after integrating similar non-standard event types (ex: “Black Ice”).
9 Smoke Renamed from Dense Smoke to just Smoke, because the non-standard event type that deals with smoke, named Smoke, may not always specify the density of the smokes.
10 Fog The case is similar with Cold/Wind Chill, except the original types are Dense Fog and Freezing Fog.
11 Winter Weather There is an existing event type named Winter Weather, but the content is broadened in this analysis to, at the very least, include Sleet, which used to be separate. Before 2004, there used to be a non-standard event type called “Mixed Precip(itation)”, which has some overlap with Winter Weather but includes Sleet in it, leading to Winter Weather being broadened to include Sleet for the purpose of accommodating Mixed Precipitation. The original definition of Winter Weather includes “snow, … , or freezing rain/drizzle”, and sleet seems to be in somewhat a spectrum between them here or within Mixed Precipitation remarks.
12 Snow Originally, it is named as Heavy Snow in NOAA’s classification. The case is similar with Dryness & Drought or Smoke (having many non-standard event types not specifying the intensity of the situation), except snow-related non-standard event types are even more diverse than those two cases.
13 Ice/Winter Storm Originally, there are Ice Storm and Winter Storm event types. However, the non-standard event types “Ice” and “Ice/Snow” are actually filled with many examples of a full-fledged ice/winter storm, so the event types are combined to accommodate those cases. Winter precipitation cases with at least one type exceeding warning criteria may be classified into this event type.
14 Wind Gust, Marine Wind Gust Originally, there are at least two variants for Wind event types: Strong or High Wind. However, there are many non-standard event types that do not distinguish between both (ex: “Wind”), and they are numerous enough that the two event types have to be merged to accommodate them.

The ones described above are the major modifications from NOAA’s 48 event types. Other than that, regular expressions are used to translate names of non-standard event types to standard event types. Regular expressions may be necessary due to typos, name variations, a non-standard event type being a subset of a standard event type (ex: dry/wet microburst to thunderstorm wind) or complex patterns are required to avoid mistakes in classifying. These patterns are formulated by carefully researching the event names, reading the remarks associated with the events with their definitions in the documentation, and cross-checking with the newer version of the dataset on NOAA’s website when necessary (ex: “Saharan Dust”).


  1. https://www.ncdc.noaa.gov/stormevents/details.jsp?type=eventtype↩︎

  2. https://www.ncdc.noaa.gov/stormevents/details.jsp?type=collection↩︎