In order to identify severe weather events in the United States to be prioritized for resource allocation and damage mitigation, a data analysis is conducted using the Storm Events Database produced by the U.S. National Oceanic and Atmospheric Administration (NOAA). The dataset is filtered, cleaned, and standardized into 43 unique weather event types, and the damages caused by each event type are summarized to draw a picture of the event’s total population health and economic damages across the United States. The top 10 events for both types of damages are identified, accounting for more than 90% of all damages caused by severe weather events. As a result, the government can prioritize efforts for damage prevention on these event types to reduce fatalities, injuries, and economic consequences of severe weather events.
Given the U.S. NOAA Storm Events Database provided here, there are two key questions we seek to answer:
Across the United States, which types of events (as indicated in
the EVTYPE variable) are most harmful with respect to
population health?
Across the United States, which types of events have the greatest economic consequences?
First, we will import the necessary R packages for this analysis. Then, each data processing step will be marked with numbers, such as 1, 2, 3, and so on.
# For data processing
library(magrittr, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)
library(stringr, warn.conflicts = FALSE)
# For data visualization
library(ggplot2, warn.conflicts = FALSE)
library(scales, warn.conflicts = FALSE)
library(patchwork, warn.conflicts = FALSE)
# For prettifying tables
library(knitr, warn.conflicts = FALSE)
We begin this analysis by importing NOAA Storm Events Database, which contains:
A. Records of significant weather phenomena/events since January 1950.
B. Associated damages (fatalities/injuries/economic) attributed to these phenomena/events.
More information about the database can be found here, although the version we use here is older than what is currently hosted by NOAA. The version utilized here includes records up to November 2011.
# Download file, if the file is not found locally
if (!file.exists("StormData.csv.bz2")) {
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile = "./StormData.csv.bz2", method = "curl")
}
# Read data, and display a few info about it
data <- read.csv("StormData.csv.bz2")
cat("Finished importing StormData.csv.bz2!", sep="\n")
## Finished importing StormData.csv.bz2!
print(str_glue("Total rows: ", nrow(data)))
## Total rows: 902297
print(str_glue("Total columns: ", ncol(data)))
## Total columns: 37
We will reduce the number of columns in the dataset to retain only the information we will be using in this analysis and allow better focus during the course of analysis.
After carefully checking the available columns, only these columns will be retained:
A. EVTYPE as the types of events being recorded.
B. FATALITIES, INJURIES,
PROPDMG, PROPDMGEXP, CROPDMG, and
CROPDMGEXP as columns related to damages.
In addition, BGN_DATE, REMARKS, and
REFNUM will be retained temporarily due to their usage in
more data transformation later. To be exact, BGN_DATE is
picked over END_DATE because all events have
BGN_DATE, but not all are guaranteed to have
END_DATE. These columns will be discarded after data
processing associated with them is finished.
main_data <- data %>% select(BGN_DATE, EVTYPE, FATALITIES:CROPDMGEXP, REMARKS, REFNUM)
print(str_glue("Number of columns left: ", ncol(main_data)))
## Number of columns left: 10
Next, we will transform the columns to make further processing and analysis easier:
A. All event types (EVTYPE) will be converted to upper
case with extra spaces removed for consistency.
B. BGN_DATE will be parsed using lubridate
to convert the values into proper date-time objects.
C. A YEAR column will be added from
BGN_DATE.
main_data <- main_data %>%
mutate(EVTYPE = EVTYPE %>% str_squish %>% str_to_upper,
BGN_DATE = mdy_hms(BGN_DATE), YEAR = year(BGN_DATE)) %>%
relocate(YEAR, .before = BGN_DATE)
head(main_data) %>% kable
| YEAR | BGN_DATE | EVTYPE | FATALITIES | INJURIES | PROPDMG | PROPDMGEXP | CROPDMG | CROPDMGEXP | REMARKS | REFNUM |
|---|---|---|---|---|---|---|---|---|---|---|
| 1950 | 1950-04-18 | TORNADO | 0 | 15 | 25.0 | K | 0 | 1 | ||
| 1950 | 1950-04-18 | TORNADO | 0 | 0 | 2.5 | K | 0 | 2 | ||
| 1951 | 1951-02-20 | TORNADO | 0 | 2 | 25.0 | K | 0 | 3 | ||
| 1951 | 1951-06-08 | TORNADO | 0 | 2 | 2.5 | K | 0 | 4 | ||
| 1951 | 1951-11-15 | TORNADO | 0 | 2 | 2.5 | K | 0 | 5 | ||
| 1951 | 1951-11-15 | TORNADO | 0 | 6 | 2.5 | K | 0 | 6 |
In terms of rows, we will reduce the number of rows for the sake of better accuracy and fairness in comparison. Read on below to find out the reasons and keep in mind that more recent years are considered more complete in the instruction. Now, after checking the documentations on NOAA’s website1 2 and cross-validating with the dataset, it seems that:
A. Up to 1992, the database only recorded three types of events (Tornado, Thunderstorm Wind, and Hail), while other types of events are not recorded, which might explains why the database is still named Storm Events Database to this day. As a result, we cannot use data from before 1993 if we intend to keep a fair comparison between all types of events that we know and are being recorded today (officially, around ~48 event types in the documentation), because including data before 1993 means all other events will be underrepresented in the analysis, affecting the results.
num_of_events <- main_data %>% filter(YEAR <= 1992) %>%
group_by(YEAR) %>% summarize(n = length(unique(EVTYPE)))
print(str_glue("Number of unique event types (<= 1992): ", num_of_events %>% pull(n) %>% max))
## Number of unique event types (<= 1992): 3
B. From 1993-1995, the records were extracted from unformatted text files, and a best effort attempt was done to extract information from the files. As noted in the documentation, there are many typos and inconsistencies in how the event types were recorded (potential sources of inaccuracies), and data from June-July 1993 is missing.
In addition, these are the years where there are unexpected
characters within -EXP columns recording the
“exponential”/magnitude of economic damages, where the real meanings of
these characters are largely unknown (see Appendix A for potential
interpretation, none of them certain), which can potentially introduce
large distortions in the results if it turns out the true values are
substantial.
exp_chars <- main_data %>% filter(between(YEAR, 1993, 1995)) %$% c(PROPDMGEXP, CROPDMGEXP)
exp_chars[!str_to_upper(exp_chars) %in% c("B", "K", "M", "")] %>% table
## .
## - ? + 0 1 2 3 4 5 6 7 8 h H
## 1 15 5 234 25 14 4 4 28 4 5 1 1 6
Many of these problems are corrected by NOAA in 2013, but the
corrections are not present in the version we use. As a result, for the
sake of better accuracy (because both -EXP and
EVTYPE seem to have noteworthy uncertainties during this
time period), we will exclude these years and base our analysis on the
records made after database software was used to store the data (>=
1996).
C. If we want the most accurate event classification, the 48 event types as we know them today exist after 2004 (there are still non-standard classification such as “PROLONG COLD” in 2003), which means data after 2004 is likely to be in the best shape to be analyzed within this dataset.
main_data %>% filter(YEAR >= 1996) %>% distinct(YEAR, EVTYPE) %>% pull(YEAR) %>% table
## .
## 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
## 199 135 122 118 110 120 99 51 38 46 50 46 46 46 46 46
However, we will only have 8 years of data if we only use data after 2004 (that is, 2004-2011). Because more data is better in this kind of analysis (at least, it would be preferable if we have at least a decade of data), we will not exclude more data due to event type reasons beyond the 3 years (1992-1995) we have excluded.
Finally, we can see that our latest data is November 2011. Because severe weather events likely have major seasonal influences in nature, it is important to match the months we use to avoid giving undue weights to certain stretch of the year in our calculation and comparison. As a result, we will use data between December 1996-November 2011 in order to represent each month exactly 15 times, ensuring that we do not give more weights or attention to particular months/seasons. Thus, our final data will be 15 years in length, which is more than the decade we desired, and we will filter the rows below.
main_data <- main_data %>% filter(BGN_DATE > "1996-12-01") %>% arrange(BGN_DATE)
# When displaying the data, we'll cut-off REMARKS to keep our previews short
head(main_data) %>% mutate(REMARKS = str_trunc(REMARKS, 30)) %>% kable
| YEAR | BGN_DATE | EVTYPE | FATALITIES | INJURIES | PROPDMG | PROPDMGEXP | CROPDMG | CROPDMGEXP | REMARKS | REFNUM |
|---|---|---|---|---|---|---|---|---|---|---|
| 1996 | 1996-12-01 | LIGHTNING | 0 | 0 | 50.0 | K | 0 | Lightning struck a drug sto… | 249360 | |
| 1996 | 1996-12-01 | TSTM WIND | 0 | 0 | 1.5 | K | 0 | Several trees were blown do… | 249361 | |
| 1996 | 1996-12-01 | FLASH FLOOD | 0 | 0 | 8.0 | K | 0 | K | Streets were reported flood… | 249362 |
| 1996 | 1996-12-01 | TSTM WIND | 0 | 0 | 15.0 | K | 2 | K | A storage building lost a r… | 249363 |
| 1996 | 1996-12-01 | TSTM WIND | 0 | 0 | 20.0 | K | 2 | K | Roof damage to houses along… | 249364 |
| 1996 | 1996-12-01 | TORNADO | 0 | 0 | 80.0 | K | 2 | K | A small tornado touched dow… | 249365 |
We will calculate the four types of damages recorded by NOAA’s Storm Events Database (fatalities, injuries, property damages, and crop damages), and then combine them into two types of damages according to the objectives of this report: population health and economic damages.
Here is a breakdown of the necessary transformation:
A. FATALITIES and INJURIES need no further
calculation, but they will be added together into
POPHEALTHDMG.
B. PROPDMG and CROPDMG need to be
multiplied with the appropriate magnitude using PROPDMGEXP
and CROPDMGEXP, respectively, and added together into
ECONDMG. The -EXP columns are translated after
cross-checking with the documentation and examining the data, then they
will be discarded when no longer needed.
dmg_exp_conv <- tibble(DMGEXP = c("", "0", "K", "M", "B"),
DMGEXPCONV = c(1, 1, 10 ^ 3, 10 ^ 6, 10 ^ 9))
main_data <- main_data %>%
left_join(dmg_exp_conv, by = join_by(PROPDMGEXP == DMGEXP)) %>%
mutate(PROPDMG = PROPDMG * DMGEXPCONV) %>%
select(-DMGEXPCONV, -PROPDMGEXP) %>%
left_join(dmg_exp_conv, by = join_by(CROPDMGEXP == DMGEXP)) %>%
mutate(CROPDMG = CROPDMG * DMGEXPCONV) %>%
select(-DMGEXPCONV, -CROPDMGEXP) %>%
mutate(POPHEALTHDMG = FATALITIES + INJURIES,
ECONDMG = PROPDMG + CROPDMG)
head(main_data %>% select(EVTYPE:CROPDMG, POPHEALTHDMG:ECONDMG)) %>% kable
| EVTYPE | FATALITIES | INJURIES | PROPDMG | CROPDMG | POPHEALTHDMG | ECONDMG |
|---|---|---|---|---|---|---|
| LIGHTNING | 0 | 0 | 50000 | 0 | 0 | 50000 |
| TSTM WIND | 0 | 0 | 1500 | 0 | 0 | 1500 |
| FLASH FLOOD | 0 | 0 | 8000 | 0 | 0 | 8000 |
| TSTM WIND | 0 | 0 | 15000 | 2000 | 0 | 17000 |
| TSTM WIND | 0 | 0 | 20000 | 2000 | 0 | 22000 |
| TORNADO | 0 | 0 | 80000 | 2000 | 0 | 82000 |
At last, we will standardize the event types that will be used in the
analysis. EVTYPE is the last column that needs specific
transformation before analysis can be initiated.
num_uniq_types <- main_data %>% pull(EVTYPE) %>% unique %>% length
print(str_glue("Number of unique event types: ", num_uniq_types))
## Number of unique event types: 323
There are 323 event types in the data, and most of them overlap with each other, making it difficult to produce an accurate analysis without grouping/standardizing them first. If the event type grouping already follows the 48 event types used in the given NOAA documentation, overlap will not be a concern, but we have found out earlier that this is only the case for year >= 2004. Thus, event types will need to be merged together and the overlaps sorted out before we have the event types we can use in the analysis later.
An important note: Due to ambiguity or some non-standardized event types having larger scopes than the ideal 48 event types, the final number of event types might not be 48 or be the same event types as defined in the documentation. Details about groupings will be discussed in Appendix B.
Event types are combined together using regular expression patterns written after studying the available event types, the included remarks describing the events, and the common definitions of the weather terms. The code is provided below, but keep in mind that it can be quite… long, due to its degree of specificity.
print(str_glue("Number of rows (before standardization): ", nrow(main_data)))
## Number of rows (before standardization): 622448
# This variable holds all cleaned data except for three event types
# which require slightly different methods to clean
# and will be cleaned separately later for convenience
std_data_main <- main_data %>%
filter(!EVTYPE %in% c("MONTHLY RAINFALL", "MONTHLY PRECIPITATION", "OTHER")) %>%
mutate(EVTYPE = case_when(
str_detect(EVTYPE, "MARINE TSTM") ~ "MARINE THUNDERSTORM WIND",
str_detect(EVTYPE, "(?<!MARINE )(?<!NON( |\\-))(TSTM|THUNDERSTORM|MICROBURST|(THUNDERSNOW)|(WALL CLOUD))") ~ "THUNDERSTORM WIND",
str_detect(EVTYPE, "(FIRE|(RED FLAG))") ~ "WILDFIRE",
str_detect(EVTYPE, "(?<!MARINE )HAIL") ~ "HAIL",
str_detect(EVTYPE, "((CHILL|(((EXTREME|EXCESSIVE|UNSEASONABLY|UNUSUALLY|RECORD|PROLONG) )?COLD( (WEATHER|TEMPERATURES))?))|(LOW TEMP)|COOL|HYPOTHERMIA|(MONTHLY TEMPERATURE))(?! AND)") ~ "COLD/WIND CHILL",
str_detect(EVTYPE, "(HEAT|WARM|HOT|HYPERTHERMIA|(RECORD HIGH)|(TEMPERATURE RECORD)|(RECORD TEMPERATURE))") ~ "EXCESSIVE/UNUSUAL HEAT",
str_detect(EVTYPE, "(((HEAVY|EXCESSIVE|PROLONGED|UNSEASONAL|RECORD|EARLY) ((RAIN(FALL)?)|PRECIPITATION))|(RAIN .HEAVY.)|WET|DROWNING)") ~ "EXCESSIVE/UNUSUAL RAIN",
str_detect(EVTYPE, "(DROUGHT|DRY|DRIEST|(RECORD LOW))") ~ "DRYNESS & DROUGHT",
str_detect(EVTYPE, "(SURF|(HIGH SWELLS))") ~ "HIGH SURF",
str_detect(EVTYPE, "SLEET") ~ "SLEET",
str_detect(EVTYPE, "(?<!(CSTL|COASTAL|LAKESHORE|TIDAL) ?)(FLOOD|FLD|(HIGH WATER)|(DAM BREAK))") ~ "FLOOD/HIGH WATER",
str_detect(EVTYPE, "((CSTL|COASTAL|TIDAL|BEACH) ?(FLOOD|FLD|EROSION)|(ASTRONOMICAL HIGH TIDE))") ~ "COASTAL FLOOD & EROSION",
str_detect(EVTYPE, "VOLCANIC") ~ "VOLCANIC ASH",
str_detect(EVTYPE, "(HURRICANE|TYPHOON|FLOYD)") ~ "HURRICANE/TYPHOON",
str_detect(EVTYPE, "(ROCK|MUD|LAND) ?(SLIDE|SLUMP)") ~ "DEBRIS FLOW",
str_detect(EVTYPE, "LAKE(-| )EFFECT SNOW") ~ "LAKE-EFFECT SNOW",
str_detect(EVTYPE, "RIP CURRENT") ~ "RIP CURRENT",
str_detect(EVTYPE, "( SEA|WAVE|(MARINE ACCIDENT))(?!SON)") ~ "ROUGH/HEAVY WAVE",
str_detect(EVTYPE, "(TORNADO|LANDSPOUT)") ~ "TORNADO",
str_detect(EVTYPE, "(FREEZE|FROST|(IC[EY].*ROAD)|(PATCHY ICE)|(BLACK ICE)|GLAZE)") ~ "FROST/FREEZE/GLAZE",
str_detect(EVTYPE, "COASTAL( )?STORM") ~ "TROPICAL STORM",
str_detect(EVTYPE, "STORM SURGE") ~ "STORM SURGE/TIDE",
str_detect(EVTYPE, "SMOKE") ~ "SMOKE",
str_detect(EVTYPE, "[FV]OG") ~ "FOG",
str_detect(EVTYPE, "((FREEZING (RAIN|DRIZZLE|PRECIP))|(MIXED PRECIP)|((WINTER|WINTRY|WINTERY)( WEATHER)?( (MIX|WEATHER)))|(SLEET)|(RAIN/SNOW)|(ICE PELLETS))(?! STORM)") ~ "WINTER WEATHER",
str_detect(EVTYPE, "(?<!ICE/)SNOW") ~ "SNOW",
str_detect(EVTYPE, "BLIZZARD") ~ "BLIZZARD",
str_detect(EVTYPE, "((ICE STORM)|(SLEET STORM)|(WINTER STORM)|(ICE/SNOW))") ~ "ICE/WINTER STORM",
str_detect(EVTYPE, "^ICE$") ~ "ICE/WINTER STORM",
str_detect(EVTYPE, "WATERSPOUT") ~ "WATERSPOUT",
str_detect(EVTYPE, "(?<!MARINE |LAKE )(STRONG|HIGH|GUSTY|LOW|GRADIENT|(NON.*)) WIND") ~ "WIND GUST",
str_detect(EVTYPE, "^(WIND|WND).*$") ~ "WIND GUST",
str_detect(EVTYPE, "((MARINE (STRONG|HIGH))|(GUSTY LAKE))") ~ "MARINE WIND GUST",
str_detect(EVTYPE, "DUST DEV[IE]L") ~ "DUST DEVIL",
str_detect(EVTYPE, "((BLOWING DUST)|(SAHARAN DUST))") ~ "DUST STORM",
str_detect(EVTYPE, "((FUNNEL CLOUD)|WHIRLWIND)") ~ "FUNNEL CLOUD",
str_detect(EVTYPE, "BLOW-OUT TIDE") ~ "ASTRONOMICAL LOW TIDE",
.default = EVTYPE
)) %>%
filter(!str_detect(EVTYPE, "(SUMMARY|(NORTHERN LIGHTS)|NONE)"))
# MONTHLY PRECIPITATION has to be divided into two event types
# based on carefully selected keywords within the REMARKS column, because
# it can be either too much or too little precipitation (very different)
std_data_mprecip <- main_data %>%
filter(EVTYPE %in% c("MONTHLY RAINFALL", "MONTHLY PRECIPITATION")) %>%
mutate(REMARKS = REMARKS %>% str_to_lower,
EVTYPE = if_else(str_detect(
REMARKS,
"(only|(less than an inch)|driest|(an inch or less))"),
"DRYNESS & DROUGHT", "EXCESSIVE/UNUSUAL RAIN")
)
# OTHER is a very ambiguous event type, so it can be divided into
# many different event types depending on the content of the REMARKS
std_data_other <- main_data %>%
filter(EVTYPE == "OTHER") %>%
mutate(REMARKS = REMARKS %>% str_to_lower,
EVTYPE = case_when(
str_detect(REMARKS, "dust( )?devil") ~ "DUST DEVIL",
str_detect(REMARKS, "blizzard") ~ "BLIZZARD",
str_detect(REMARKS, "excess rainfall") ~ "EXCESSIVE/UNUSUAL RAIN",
str_detect(REMARKS, "lightning-caused fires") ~ "LIGHTNING",
str_detect(REMARKS, "thunderstorm") ~ "THUNDERSTORM WIND",
str_detect(REMARKS, "(wind|squall|gust)") ~ "WIND GUST",
str_detect(REMARKS, "rogue wave") ~ "HIGH SURF",
str_detect(REMARKS, "maximum temperature") ~ "EXCESSIVE/UNUSUAL HEAT"
)
)
# Compile them together into our standardized data
std_data <- bind_rows(std_data_main, std_data_mprecip, std_data_other) %>%
arrange(REFNUM)
# Print some info about the standardization process
num_of_events <- std_data %>% pull(EVTYPE) %>% unique %>% length
print(str_glue("Number of rows (after standardization): ", nrow(std_data)))
## Number of rows (after standardization): 622442
print(str_glue("Number of unique events (after standardization): ", num_of_events))
## Number of unique events (after standardization): 43
After grouping and merging, there are 43 event types left (remember,
full details on how the decisions were made are available on Appendix
B), and there are six less rows after the process. The missing six rows
are rows with EVTYPE containing the words “SUMMARY”,
“NORTHERN LIGHTS”, and “NONE”, which denote a summary of events already
recorded in other event types or events that are not actually severe
weather events.
Our data processing is almost done, and the last thing we need to do
is discarding several columns that have been used for processing and no
longer needed to produce the main analysis later (BGN_DATE,
YEAR, REMARKS, REFNUM).
clean_data <- std_data %>% select(-(YEAR:BGN_DATE), -(REMARKS:REFNUM))
head(clean_data) %>% kable
| EVTYPE | FATALITIES | INJURIES | PROPDMG | CROPDMG | POPHEALTHDMG | ECONDMG |
|---|---|---|---|---|---|---|
| LIGHTNING | 0 | 0 | 50000 | 0 | 0 | 50000 |
| THUNDERSTORM WIND | 0 | 0 | 1500 | 0 | 0 | 1500 |
| FLOOD/HIGH WATER | 0 | 0 | 8000 | 0 | 0 | 8000 |
| THUNDERSTORM WIND | 0 | 0 | 15000 | 2000 | 0 | 17000 |
| THUNDERSTORM WIND | 0 | 0 | 20000 | 2000 | 0 | 22000 |
| TORNADO | 0 | 0 | 80000 | 2000 | 0 | 82000 |
As a reminder, below are the two questions we need to answer in this report:
Across the United States, which types of events (as indicated in
the EVTYPE variable) are most harmful with respect to
population health?
Across the United States, which types of events have the greatest economic consequences?
In order to answer these questions, two visualizations will be produced showing the top 10 event types that contribute the damages required in the objectives. The visualizations are deemed to be sufficient to answer the objectives because the top 10 contributes ~90% and ~96% of the total population health and economic damages, respectively. By putting emphasis on the events with the most impact, the results will also help answer the other objective this report is meant to help with: prioritizing resources that will be used to prepare for severe weather events.
The analysis and visualizations will be provided in the next sections. Note that no specific recommendation is required for the report.
In terms of population health, there are two possible variables in
NOAA Storm Events Database that are relevant to the topic: fatalities
and injuries. To have a single metric that captures both, fatalities and
injuries are added together into a single variable named
POPHEALTHDMG. This is the metric displayed in the
visualization.
pophealthdmg <- clean_data %>%
group_by(EVTYPE) %>% summarize(SUM_DMG = sum(POPHEALTHDMG),
AVG_DMG = mean(POPHEALTHDMG) %>% round(2),
LABEL = str_glue(label_comma()(SUM_DMG),
" victims (", AVG_DMG, " avg)")
) %>%
arrange(-SUM_DMG) %>% slice_head(n = 10)
pophealth_plot <- ggplot(pophealthdmg, aes(reorder(EVTYPE, SUM_DMG), SUM_DMG, fill = SUM_DMG)) +
geom_col(show.legend = FALSE) +
geom_text(aes(label = if_else(SUM_DMG > 10000, LABEL, NA)), color = "white", hjust = 1.1, size = 3, na.rm = TRUE) +
geom_label(aes(label = if_else(SUM_DMG <= 10000, LABEL, NA)), size = 3, fill = "#E2E3EF", colour = "black", label.size = NA, hjust = -0.1, na.rm = TRUE) +
labs(title = "Most Harmful Weather Events to Population Health",
subtitle = "Top 10 Weather Types, Damages Between 1996-2011",
x = NULL, y = "Total Fatalities & Injuries",
caption = "Data from NOAA Storm Database (Dec '96–Nov '11). Avg is calculated per event.") +
coord_flip() +
scale_y_continuous(labels = label_comma()) +
scale_fill_gradient(high = "#002142", low = "#005FA3")
pophealth_plot
As mentioned before, the top 10 events contribute around ~90% of all damages caused by weather events between December 1996 to November 2011. From the figure, we can see that Tornado is the event type with the most fatalities and injuries by a large margin, while the event type with the highest average victim per event is Hurricane/Typhoon. As a result, although it is true that it would be more useful to prepare for events with large damage total like Tornado, local governments should be especially cautious when events like Hurricane/Typhoon with high average happen, as it may claim more victims in one incident once it happens.
Provided below is the complete tally for the top 10 events. Note how there are overwhelmingly more injuries than fatalities in all events, except for Rip Current. It is very difficult to escape a rip current once it has caught a person, which may explain the fatality number.
clean_data %>%
group_by(EVTYPE) %>% summarize(n = n(),
Injuries = sum(INJURIES),
Fatalities = sum(FATALITIES),
Total = sum(POPHEALTHDMG),
Average = mean(POPHEALTHDMG) %>% round(2)) %>%
arrange(-Total) %>% slice_head(n = 10) %>% mutate(`#` = row_number(), .before = EVTYPE) %>%
kable
| # | EVTYPE | n | Injuries | Fatalities | Total | Average |
|---|---|---|---|---|---|---|
| 1 | TORNADO | 21933 | 19963 | 1486 | 21449 | 0.98 |
| 2 | FLOOD/HIGH WATER | 74507 | 8418 | 1206 | 9624 | 0.13 |
| 3 | EXCESSIVE/UNUSUAL HEAT | 2696 | 7573 | 2001 | 9574 | 3.55 |
| 4 | THUNDERSTORM WIND | 201408 | 4821 | 359 | 5180 | 0.03 |
| 5 | LIGHTNING | 12300 | 3828 | 599 | 4427 | 0.36 |
| 6 | WIND GUST | 23163 | 1397 | 340 | 1737 | 0.07 |
| 7 | ICE/WINTER STORM | 12658 | 1299 | 243 | 1542 | 0.12 |
| 8 | WILDFIRE | 4118 | 1425 | 87 | 1512 | 0.37 |
| 9 | HURRICANE/TYPHOON | 228 | 1309 | 89 | 1398 | 6.13 |
| 10 | RIP CURRENT | 703 | 438 | 517 | 955 | 1.36 |
In terms of economic consequences, there are two possible variables
in NOAA Storm Events Database that are relevant to the topic: property
and crop damages. To have a single metric that captures both, property
and crop damages are added together into a single variable named
ECONDMG. This is the metric displayed in the
visualization.
label_dollar_abbrev <- label_currency(scale_cut = cut_short_scale())
econdmg <- clean_data %>%
group_by(EVTYPE) %>% summarize(SUM_DMG = sum(ECONDMG),
AVG_DMG = mean(ECONDMG) %>% round(2),
LABEL = str_glue(label_dollar_abbrev(SUM_DMG),
" damages (", label_dollar_abbrev(AVG_DMG),
" avg)")
) %>%
arrange(-SUM_DMG) %>% slice_head(n = 10)
econ_plot <- ggplot(econdmg, aes(reorder(EVTYPE, SUM_DMG), SUM_DMG, fill = SUM_DMG)) +
geom_col(show.legend = FALSE) +
geom_text(aes(label = if_else(SUM_DMG > 10 ^ 11, LABEL, NA)), color = "white", hjust = 1.1, size = 3, na.rm = TRUE) +
geom_label(aes(label = if_else(SUM_DMG <= 10 ^ 11, LABEL, NA)), size = 3, fill = "#E2E3EF", colour = "black", label.size = NA, hjust = -0.1, na.rm = TRUE) +
labs(title = "Weather Events with the Most Economic Consequences",
subtitle = "Top 10 Weather Types, Damages Between 1996-2011",
x = NULL, y = "Total Economic Damages (Property + Crop)",
caption = "Data from NOAA Storm Database (Dec '96–Nov '11). Avg is calculated per event.") +
coord_flip() +
scale_y_continuous(labels = label_dollar_abbrev) +
scale_fill_gradient(high = "#002142", low = "#005FA3")
econ_plot
As mentioned before, the top 10 events contribute around ~96% of all economic damages caused by weather events between December 1996 to November 2011, which is a larger proportion than the case with population health. Flood/High Water is the event type contributing the most damage at $163B, followed by Hurricane/Typhoon at $85B, and it is worth noting that Hurricane/Typhoon has the highest average damage at $374M too, followed by Storm Surge/Tide at $126M damages.
The complete tally will be given below, and we can see from the tally that Flood/High Water is a relatively “common” event compared to Hurricane/Typhoon (or Storm Surge/Tide), which likely helps explain the disparity in average damage that we see here. After all, Flood/High Water has many possible causes and includes many kinds of events, from large-scale flash flood to urban & small stream flood, which lets it have comparatively low average compared to Hurricane/Typhoon. In fact, the tally seems to suggest that rare top 10 events that have less than 1K incidents possess exceptionally high average damage when they occur.
clean_data %>%
group_by(EVTYPE) %>% summarize(n = n(),
Property = sum(PROPDMG),
Crop = sum(CROPDMG),
Total = sum(ECONDMG),
Average = mean(ECONDMG) %>% round(2)) %>%
arrange(-Total) %>% slice_head(n = 10) %>% ungroup %>%
mutate(`#` = row_number(), .before = EVTYPE) %>%
mutate(across(Property:Total, label_dollar_abbrev), Average = label_currency()(Average)) %>%
kable
| # | EVTYPE | n | Property | Crop | Total | Average |
|---|---|---|---|---|---|---|
| 1 | FLOOD/HIGH WATER | 74507 | $157.249B | $5.93B | $163.18B | $2,190,158 |
| 2 | HURRICANE/TYPHOON | 228 | $80.337B | $5.00B | $85.34B | $374,284,920 |
| 3 | STORM SURGE/TIDE | 381 | $47.827B | $850K | $47.83B | $125,532,774 |
| 4 | TORNADO | 21933 | $23.898B | $271M | $24.17B | $1,101,959 |
| 5 | HAIL | 197111 | $13.982B | $2.31B | $16.29B | $82,669 |
| 6 | DRYNESS & DROUGHT | 2502 | $919M | $12.88B | $13.80B | $5,514,463 |
| 7 | THUNDERSTORM WIND | 201408 | $7.508B | $976M | $8.48B | $42,124 |
| 8 | TROPICAL STORM | 658 | $7.596B | $677M | $8.27B | $12,572,776 |
| 9 | WILDFIRE | 4118 | $7.687B | $399M | $8.09B | $1,963,509 |
| 10 | WIND GUST | 23163 | $5.317B | $592M | $5.91B | $255,120 |
Visualizations have been created to answer the two questions presented at the beginning of the report.
pophealth_plot1 <- pophealth_plot + labs(title = "Population Health Harm", subtitle = NULL, caption = NULL)
econ_plot1 <- econ_plot + labs(title = "Economic Damages", subtitle = NULL, caption = NULL)
pophealth_plot1 + econ_plot1 +
plot_annotation(title = "Top 10 Severe Weather Events by Damage Type",
subtitle = "December 1996–November 2011",
caption = "Data from NOAA Storm Database.")
Based on the analysis, to reduce or mitigate damages caused by severe weather events, the government can prioritize their resources for the top 10 events displayed by the respective population health and economic damage visualizations. Because the top 10 events alone accounted for more than 90% of total damages, they are a good pick for priority given limited resources. The combined visualizations can be seen above.
-EXP Column
ValuesAccording to the given documentation (pg. 12), damage magnitude should be denoted with “K”, “M”, or “B”. Other than that, we might expect empty strings or other similar ways to denote that a magnitude indicator is not needed, when the damage value is too small to be given a magnitude. However, there are other mysterious characters present in data between 1993-1995, displayed below:
exp_chars <- data %>% filter(between(BGN_DATE %>% mdy_hms %>% year, 1993, 1995)) %$% c(PROPDMGEXP, CROPDMGEXP)
exp_chars[!str_to_upper(exp_chars) %in% c("B", "K", "M", "")] %>% table
## .
## - ? + 0 1 2 3 4 5 6 7 8 h H
## 1 15 5 234 25 14 4 4 28 4 5 1 1 6
From all the characters we can see, “-” and “h”/“H” might be possible
to explain as indicator for empty values or hundreds, respectively, and
0 could be something similar with “-”, but the rest is much more
difficult to puzzle out. Suppose there is a PROPDMG of 12,
if the PROPDMGEXP value is 2, does it mean the property
damage is 12^2? Or 12 * 10^2? Or perhaps it is something else
entirely?
There is something interesting in the 1993 annual Storm Data Publication (can be obtained here) that could hint what an integer with a single digit might mean for damage magnitude. A picture is not attached here (in case there will be someone who sees it and consider it a new figure exceeding the 3-figure limit), but more or less it reads as follows:
REFERENCE NOTES
STORM DATA CATAGORIES (sic)
1 Less than $50
2 $50 to $500
…
8 $50 Million to $500 Million
9 $500 Million to $5 Billion
Is that what the magnitude numbers mean? On the other hand, the
dataset version that has to be used in this report is an old version,
and the newest version currently hosted at NOAA’s website no longer has
the -EXP columns, just the damage numbers directly that are
easier to interpret. It is not easy to look up the current figures for
each event with these magnitude numbers in the database, but one of the
results seems to indicate that these numbers means 0; that is, instead
of 12 * 10^2, the damage should be 12 * 10^0.
In conclusion, it is uncertain and ambiguous what these numbers in
-EXP columns actually mean. It could be something large
like $500 millions in the 1993 annual Storm Data Publication, or it
could be as small as 0. The best way to deal with this would be
downloading the newest version of the database which has the more
accurate numbers, but we are currently stuck using this old version for
now. Most of our data cleaning problems here probably do not exist
anymore in the newest version of the dataset, yet it might feel a bit
unfair if we skip all the problems while other people still have to deal
with using the old version…
EVTYPEFrom 323 unique event types, the event types are merged together
until only 43 event types are left. While events are combined in a
manner as close as possible to NOAA’s 48 event types, there are several
event types in EVTYPE that are impossible to classify
cleanly into one of NOAA’s event types due to ambiguity. This ambiguity
may occur because one EVTYPE actually includes two or more
standard event types at the same time, or it can also occur because
remarks that may help classification also do not exist within that event
type. Both these cases, where the classifications are done differently
due to circumstances in the dataset, are documented below:
| No | EVTYPE | Description |
|---|---|---|
| 1 | Cold/Wind Chill | In NOAA’s event types, Cold/Wind Chill is actually distinguished into Cold/Wind Chill and Extreme Cold/Wind Chill. However, there are many non-standard event types (ex: “Cold”, “Cold Temperatures”) that do not identify whether it is just the normal Cold/Wind Chill or it is the Extreme type, so the two variants are merged in order to accomodate those non-standard event types. |
| 2 | Excessive/Unusual Heat | The case is similar with Cold/Wind Chill, while the definition is also broadened with Unusual Heat since the dataset also contains those particular cases. Unusual heat may refer to situations where the temperature by itself is not very high, but it happens at an unusual time (ex: outside the expected months). |
| 3 | Excessive/Unusual Rain | Originally, it is named as Heavy Rain in NOAA’s classification, but the definition (or at least the wording) is broadened to include unusual rain for similar reasons with Excessive/Unusual Heat. |
| 4 | Dryness & Drought | Originally, it is named as Drought in NOAA’s classification. However, there are some non-standard event types that do not record outright drought, but simply dry conditions, so the definition is broadened here. |
| 5 | Flood/High Water | Flood/High Water is a combination of Flood, Flash Flood, and a non-standard category called High Water. Flood and Flash Flood are combined due to similar reasons with Cold/Wind Chill, while High Water is a separate category where an outright flood does not occur, but the water (usually within a creek) is high enough to result in casualties. |
| 6 | Coastal Flood & Erosion | Originally, it is only named as Coastal Flood in NOAA’s classification. However, the definition is broadened because there are non-standard event types dealing with beach erosion (ex: “Coastal Flooding/Erosion), and both may happen at the same time. |
| 7 | Rough/Heavy Wave | The closest existing type to this would be High Surf, but High Surf is defined as “large waves on or near shore”, while this type is created to accommodate non-standard event types where the waves may not necessarily happen close to shores. |
| 8 | Frost/Freeze/Glaze | Originally, it is only named as Frost/Freeze in NOAA’s classification. The definition is slightly broadened by adding Glaze to make it more descriptive after integrating similar non-standard event types (ex: “Black Ice”). |
| 9 | Smoke | Renamed from Dense Smoke to just Smoke, because the non-standard event type that deals with smoke, named Smoke, may not always specify the density of the smokes. |
| 10 | Fog | The case is similar with Cold/Wind Chill, except the original types are Dense Fog and Freezing Fog. |
| 11 | Winter Weather | There is an existing event type named Winter Weather, but the content is broadened in this analysis to, at the very least, include Sleet, which used to be separate. Before 2004, there used to be a non-standard event type called “Mixed Precip(itation)”, which has some overlap with Winter Weather but includes Sleet in it, leading to Winter Weather being broadened to include Sleet for the purpose of accommodating Mixed Precipitation. The original definition of Winter Weather includes “snow, … , or freezing rain/drizzle”, and sleet seems to be in somewhat a spectrum between them here or within Mixed Precipitation remarks. |
| 12 | Snow | Originally, it is named as Heavy Snow in NOAA’s classification. The case is similar with Dryness & Drought or Smoke (having many non-standard event types not specifying the intensity of the situation), except snow-related non-standard event types are even more diverse than those two cases. |
| 13 | Ice/Winter Storm | Originally, there are Ice Storm and Winter Storm event types. However, the non-standard event types “Ice” and “Ice/Snow” are actually filled with many examples of a full-fledged ice/winter storm, so the event types are combined to accommodate those cases. Winter precipitation cases with at least one type exceeding warning criteria may be classified into this event type. |
| 14 | Wind Gust, Marine Wind Gust | Originally, there are at least two variants for Wind event types: Strong or High Wind. However, there are many non-standard event types that do not distinguish between both (ex: “Wind”), and they are numerous enough that the two event types have to be merged to accommodate them. |
The ones described above are the major modifications from NOAA’s 48 event types. Other than that, regular expressions are used to translate names of non-standard event types to standard event types. Regular expressions may be necessary due to typos, name variations, a non-standard event type being a subset of a standard event type (ex: dry/wet microburst to thunderstorm wind) or complex patterns are required to avoid mistakes in classifying. These patterns are formulated by carefully researching the event names, reading the remarks associated with the events with their definitions in the documentation, and cross-checking with the newer version of the dataset on NOAA’s website when necessary (ex: “Saharan Dust”).