United States Storm Data Analysis 1950 - 2011: tornado caused most injuries and flood caused most damages

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

There are two questions we need to carry in the analysis:
- Which types of events are most harmful to population health?
- Which types of events have the greatest economic consequences?
After exploring for the data over years, we noted tornado and flood are the answers, the steps below demostrate how the analysis is reproduced.

1. DATA PROCESSING

1.1 Load library and settings

Preload of packages we need. mainly from Tidyverse and Plotly ggplot2

knitr::opts_chunk$set(echo = TRUE)
library(R.utils)
library(readr)
library(tidyr)
library(lubridate)
library(stringr)
library(dplyr)
library(ggplot2)
library(reshape2)
library(gridExtra)
library(treemap)
library(d3treeR)
library(ggmap)
library(plotly)
library(GGally)

1.2 Data ingestion

The data for this project come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. Aside of this we added event tibble for better mapping of all the events appears in the data, and states abbreviations for mapping of the geolocations.

Storm data, direct download and unzipped from bz2 to csv
Event table contains of 48 event types, pulled from the National Weather Service Documentation
States from United States official abbreviations

if (!file.exists("storm.csv")) {
download.file('https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2', 'storm.csv.bz2', method='curl' )
bunzip2(filename="storm.csv.bz2", "storm.csv")
}
storm <- read_csv("storm.csv", col_name = TRUE)
evnt_tbl <- read_csv("storm-data-event-table.csv", col_name = TRUE)
state_tbl <- read_csv("state.csv", col_name = TRUE)

1.3 Prepare event types

When profiling data, we immediately found that EVTYPE is the descriptive keyword we need to base, but there are 977 distinct values. Most of them are dirty and need for clean up due to several reasons:

Aggregated result are found. e.g. “SUMMARY OF MARCH 24-25”, “Summary of May 9-10”
Some measurement numbers like speed and probably height is recorded following after the event name e.g. “THUNDERSTORM WIND 65MPH”, “HIGH WIND 70”
Incorrect spelling e.g. “THUNDEERSTORM WINDS”, “THUNERSTORM WINDS”
More than one event may be listed together e.g. “DROUGHT/EXCESSIVE HEAT”, “HIGH WINDS/FLOODING”
Some word is too common and can be appear in several events e.g. “wind” in “High Wind”, “Strong Wind”, “Marine High Wind”, “Thunderstorm Wind”

To solve this, we created a new event column so:

The unwanted row and words are clear away
Use precised keyword to correct event name e.g “^thun.*” is in used for correction of THUNDEERSTORM WINDS / THUNERSTORM WINDS
Correct most event with starting keywords, as if two event appears we consider the one appear first
Take extra care for ending keywords, it is meant not to overlapped the result being corrected ready by starting keywords
Assumptions are made here according to the 48 type of events listed in the documentation e.g Landslide / Landslump / Mudslide are addressed as “Debris Flow”, “Light Snow” is addressed as “Winter Weather”

storm_event <- storm %>%
# Summary should be filter away
filter(!str_detect(EVTYPE, "SUMMARY.*") & !str_detect(EVTYPE, "Summary.*")) %>%
# Build new column name it Event for data cleansing while original column EVTYPE is kept for reference
mutate(Event = EVTYPE) %>%
# Remove measures description to align the event name
mutate(Event = str_replace_all(Event, "G[0-9].*", "")) %>%
mutate(Event = str_replace_all(Event, "F[0-9].*", "")) %>%
mutate(Event = str_replace_all(Event, "[0-9].*", "")) %>%
mutate(Event = str_replace_all(Event, "\\(.*", "")) %>%
mutate(Event = str_to_lower(Event)) %>%
# Collect events starting with specific words
mutate(Event = str_replace_all(Event, regex("^avalan.*"), "Avalanche")) %>%
mutate(Event = str_replace_all(Event, regex("^blizzard.*"), "Blizzard")) %>%
mutate(Event = str_replace_all(Event, regex("^coastal.*"), "Coastal Flood")) %>%
mutate(Event = str_replace_all(Event, regex("^cold.*"), "Cold/Wind Chill")) %>%
mutate(Event = str_replace_all(Event, regex("^record.*?cold.*"), "Cold/Wind Chill")) %>%
mutate(Event = str_replace_all(Event, regex("^land.*"), "Debris Flow")) %>%
mutate(Event = str_replace_all(Event, regex("^mudslide$"), "Debris Flow")) %>%
mutate(Event = str_replace_all(Event, regex("^drought.*"), "Drought")) %>%
mutate(Event = str_replace_all(Event, regex("^dry.*"), "Drought")) %>%
mutate(Event = str_replace_all(Event, regex("^dust d.*"), "Dust Devil")) %>%
mutate(Event = str_replace_all(Event, regex("^extreme.*?cold.*"), "Extreme Cold/Wind Chill")) %>%
mutate(Event = str_replace_all(Event, regex("^ext.*?cold/windchill.*"), "Extreme Cold/Wind Chill")) %>%
mutate(Event = str_replace_all(Event, regex("^extreme wind.*"), "Extreme Cold/Wind Chill")) %>%
mutate(Event = str_replace_all(Event, regex("^flash flood.*"), "Flash Flood")) %>%
mutate(Event = str_replace_all(Event, regex("^flood.*"), "Flood")) %>%
mutate(Event = str_replace_all(Event, regex("^river flood$"), "Flood")) %>%
mutate(Event = str_replace_all(Event, regex("^urban/small stream flood$"), "Flood")) %>%
mutate(Event = str_replace_all(Event, regex("^freez.*"), "Frost/Freeze")) %>%
mutate(Event = str_replace_all(Event, regex("^frost.*"), "Frost/Freeze")) %>%
mutate(Event = str_replace_all(Event, regex("^fog.*"), "Freezing Fog")) %>%
mutate(Event = str_replace_all(Event, regex("^funnel.*"), "Funnel Cloud")) %>%
mutate(Event = str_replace_all(Event, regex("^hail.*"), "Hail")) %>%
mutate(Event = str_replace_all(Event, regex("^small hail$"), "Hail")) %>%
mutate(Event = str_replace_all(Event, regex("^extreme heat.*"), "Heat")) %>%
mutate(Event = str_replace_all(Event, regex("^record.*?heat.*"), "Heat")) %>%
mutate(Event = str_replace_all(Event, regex("^heat.*"), "Heat")) %>%
mutate(Event = str_replace_all(Event, regex("^heavy rain.*"), "Heavy Rain")) %>%
mutate(Event = str_replace_all(Event, regex("^heavy.*?snow.*"), "Heavy Snow")) %>%
mutate(Event = str_replace_all(Event, regex("^excessive snow$"), "Heavy Snow")) %>%
mutate(Event = str_replace_all(Event, regex("^high surf.*"), "High Surf")) %>%
mutate(Event = str_replace_all(Event, regex("^high wind.*"), "High Wind")) %>%
mutate(Event = str_replace_all(Event, regex("^wind.*"), "High Wind")) %>%
mutate(Event = str_replace_all(Event, regex("^hurricane.*"), "Hurricane (Typhoon)")) %>%
mutate(Event = str_replace_all(Event, regex("^typhoon.*"), "Hurricane (Typhoon)")) %>%
mutate(Event = str_replace_all(Event, regex("^ice.*"), "Ice Storm")) %>%
mutate(Event = str_replace_all(Event, regex("^lake.*?effect snow$"), "Lake-Effect Snow")) %>%
mutate(Event = str_replace_all(Event, regex("^lightning.*"), "Lightning")) %>%
mutate(Event = str_replace_all(Event, regex("^marine tstm.*"), "Marine Thunderstorm Wind")) %>%
mutate(Event = str_replace_all(Event, regex("^rip current.*"), "Rip Current")) %>%
mutate(Event = str_replace_all(Event, regex("^sleet.*"), "Sleet")) %>%
mutate(Event = str_replace_all(Event, regex("^storm surge.*"), "Storm Surge/Tide")) %>%
mutate(Event = str_replace_all(Event, regex("^strong high wind$"), "Strong Wind")) %>%
mutate(Event = str_replace_all(Event, regex("^strong wind.*"), "Strong Wind")) %>%
mutate(Event = str_replace_all(Event, regex("^thun.*"), "Thunderstorm Wind")) %>%
mutate(Event = str_replace_all(Event, regex("^tstm.*"), "Thunderstorm Wind")) %>%
mutate(Event = str_replace_all(Event, regex("^severe thunderstorms$"), "Thunderstorm Wind")) %>%
mutate(Event = str_replace_all(Event, regex("^torn.*"), "Tornado")) %>%
mutate(Event = str_replace_all(Event, regex("^tropical storm.*"), "Tropical Storm")) %>%
mutate(Event = str_replace_all(Event, regex("^volcanic.*"), "Volcanic Ash")) %>%
mutate(Event = str_replace_all(Event, regex("^water.*"), "Waterspout")) %>%
mutate(Event = str_replace_all(Event, regex("^wild.*?fire.*"), "Wildfire")) %>%
mutate(Event = str_replace_all(Event, regex("^winter s.*"), "Winter Storm")) %>%
mutate(Event = str_replace_all(Event, regex("^winter w.*"), "Winter Weather")) %>%
mutate(Event = str_replace_all(Event, regex("^snow.*"), "Winter Weather")) %>%
mutate(Event = str_replace_all(Event, regex("^light snow$"), "Winter Weather")) %>%
mutate(Event = str_replace_all(Event, regex("^wint.*?mix"), "Winter Weather")) %>%
# Collect events ending with specific words
mutate(Event = str_replace_all(Event, regex(".*drought$"), "Drought")) %>%
mutate(Event = str_replace_all(Event, regex(".*dry$"), "Drought")) %>%
mutate(Event = str_replace_all(Event, regex(".*flooding.*"), "Flood")) %>%
mutate(Event = str_replace_all(Event, regex(".*fld.*"), "Flood")) %>%
mutate(Event = str_replace_all(Event, regex(".*high surf$"), "High Surf")) %>%
mutate(Event = str_replace_all(Event, regex(".*thunderstorm$"), "Thunderstorm Wind")) %>%
mutate(Event = str_replace_all(Event, regex(".*winter storm$"), "Winter Storm")) %>%
mutate(Event = str_replace_all(Event, regex(".*winter weather$"), "Winter Weather")) %>%
mutate(Event = str_to_title(Event)) %>%
as.data.frame()

For a preview of prepared data, we now see
- 977 EVTYPE is reduced to 317 Events with major event categorized ready
- all 48 event categorized and shows up

storm_event %>% count(Event, sort = TRUE) %>% tibble()

## # A tibble: 317 x 2
##    Event                         n
##    <chr>                     <int>
##  1 Thunderstorm Wind        324815
##  2 Hail                     288829
##  3 Tornado                   60687
##  4 Flash Flood               55038
##  5 Flood                     29876
##  6 High Wind                 22252
##  7 Heavy Snow                15843
##  8 Lightning                 15766
##  9 Marine Thunderstorm Wind  11987
## 10 Heavy Rain                11801
## # ... with 307 more rows

df <- merge(evnt_tbl, storm_event)
df %>% count(Event, sort = TRUE)

##                       Event      n
## 1         Thunderstorm Wind 324815
## 2                      Hail 288829
## 3                   Tornado  60687
## 4               Flash Flood  55038
## 5                     Flood  29876
## 6                 High Wind  22252
## 7                Heavy Snow  15843
## 8                 Lightning  15766
## 9  Marine Thunderstorm Wind  11987
## 10               Heavy Rain  11801
## 11             Winter Storm  11439
## 12           Winter Weather   9233
## 13             Funnel Cloud   6980
## 14                 Wildfire   4231
## 15               Waterspout   3860
## 16              Strong Wind   3775
## 17                  Drought   2802
## 18                 Blizzard   2728
## 19                Ice Storm   2094
## 20  Extreme Cold/Wind Chill   1894
## 21             Frost/Freeze   1831
## 22           Excessive Heat   1678
## 23                Dense Fog   1293
## 24                High Surf    968
## 25                     Heat    956
## 26            Coastal Flood    863
## 27              Rip Current    777
## 28          Cold/Wind Chill    729
## 29           Tropical Storm    697
## 30         Lake-Effect Snow    659
## 31              Debris Flow    630
## 32             Freezing Fog    539
## 33              Marine Hail    442
## 34               Dust Storm    427
## 35         Storm Surge/Tide    409
## 36                Avalanche    387
## 37      Hurricane (Typhoon)    299
## 38    Astronomical Low Tide    174
## 39               Dust Devil    151
## 40         Marine High Wind    135
## 41                    Sleet     78
## 42      Tropical Depression     60
## 43       Marine Strong Wind     48
## 44             Volcanic Ash     29
## 45          Lakeshore Flood     23
## 46                   Seiche     21
## 47                  Tsunami     20
## 48              Dense Smoke     10

1.4 Prepare property and crop damage amounts

While looking into economic consequences, we noted PROPDMG * PROPDMGEXP and CROPDMG * CROPDMGEXP are two pairs.
The EXP seems have stored the amount explanations like hundred, thousand, million and billion.
Therefore next transformation is just for monetary conversion.
For number 0-8, unfortunately there isn’t much explanation we could find in the document, therefore they are treated as 1 to our result. i.e. no effects

unique(df$PROPDMGEXP)

##  [1] "K" NA  "M" "?" "0" "7" "4" "5" "1" "B" "+" "2" "8" "H" "m" "-" "3" "6" "h"

unique(df$CROPDMGEXP)

## [1] "K" NA  "M" "B" "0" "k" "?" "m" "2"

df <- df %>%
mutate(BgnDate = as_date(BGN_DATE, format = "%m/%d/%Y")) %>%
mutate(EndDate = as_date(END_DATE, format = "%m/%d/%Y")) %>%
mutate(BgnYear = year(BgnDate)) %>%
mutate(EndYear = year(EndDate)) %>%
mutate(PropDmgAmt =
    case_when(PROPDMGEXP %in% c("H", "h") ~ PROPDMG * 10^2,
            PROPDMGEXP %in% c("K", "k") ~ PROPDMG * 10^3,
            PROPDMGEXP %in% c("M", "m") ~ PROPDMG * 10^6,
            PROPDMGEXP %in% c("B", "b") ~ PROPDMG * 10^9,
            TRUE ~ PROPDMG)
            ) %>%
mutate(CropDmgAmt =
    case_when(CROPDMGEXP %in% c("H", "h") ~ CROPDMG * 10^2,
            CROPDMGEXP %in% c("K", "k") ~ CROPDMG * 10^3,
            CROPDMGEXP %in% c("M", "m") ~ CROPDMG * 10^6,
            CROPDMGEXP %in% c("B", "b") ~ CROPDMG * 10^9,
            TRUE ~ CROPDMG)
            ) %>%
mutate(TotalDmgAmt = PropDmgAmt + CropDmgAmt)

2. RESULT

2.1 Across the United States, which types of events are most harmful to population health?

To have quick result first we select top 10 events out of 48, sorting by the injuries sum.
We see tornado has way more harms to population than the following rest 9 events.

df %>% count(STATE, sort = TRUE) %>% tibble()

## # A tibble: 72 x 2
##    STATE     n
##    <chr> <int>
##  1 TX    83689
##  2 KS    53419
##  3 OK    46726
##  4 MO    35632
##  5 IA    30967
##  6 NE    30249
##  7 IL    28458
##  8 AR    27095
##  9 NC    25307
## 10 GA    25234
## # ... with 62 more rows

p1 <- df %>%
    group_by(Event) %>%
    summarise(
            Fatalities.Sum = sum(FATALITIES, na.rm = TRUE),
            Injuries.Sum = sum(INJURIES, na.rm = TRUE)
    ) %>%
    top_n(10) %>%
    melt(id = "Event")
    
ggplot(data = p1, aes(x = reorder(Event, value), y = value, fill = variable)) +
geom_bar(position = "dodge", stat = "identity") + coord_flip() +
theme_minimal()

To explore in more detail, a tree map is plotted with 3 levels:
1. Designator C = County | Z - Zone | M - Marine 2. All 48 events 3. Number of injuries | fatalities Thus in each level, area represents their harmed population accordingly.

p2s <- df %>%
    group_by(Designator, Event) %>%
    summarise(
            Fatalities.Sum = sum(FATALITIES, na.rm = TRUE),
            Injuries.Sum = sum(INJURIES, na.rm = TRUE)
    ) %>%
    top_n(10) %>%
    melt(id = c("Designator", "Event"))

p2 <- treemap(p2s,
            index = c("Designator", "Event", "variable"),
            vSize = "value",
            type = "index",
            palette = "Set2",
            bg.labels = c("white"),
            align.labels = list(
              c("center", "center"),
              c("right", "bottom")
            )
          )

d3tree2(p2, rootname = "United States 1950 - 2011 number of injuries and fatalities by event (C - County | Z - Zone | M - Marine)")

2.2 Across the United States, which types of events have the greatest economic consequences?

For quick result top 10 events out of 48 shows, we can see flood and hurricane are the top 2 in damage to property wherease drought is the top 1 in damage to crop.

p3 <- df %>%
    group_by(Event) %>%
    summarise(
            PropDmgAmt.Sum = sum(PropDmgAmt, na.rm = TRUE),
            CropDmgAmt.Sum = sum(CropDmgAmt, na.rm = TRUE)
    ) %>%
    top_n(10) %>%
    melt(id = "Event")
    
ggplot(data = p3, aes(x = reorder(Event, value), y = value, fill = variable)) +
geom_bar(position = "dodge", stat = "identity") + coord_flip() +
theme_minimal()

Same again, a tree map is plotted with 3 levels:
1. Designator C = County | Z - Zone | M - Marine
2. All 48 events 3. Damage to property | crops Thus in each level, area represents their damaged monetary amount.

p4s <- df %>%
    group_by(Designator, Event) %>%
    summarise(
            PropDmgAmt.Sum = sum(PropDmgAmt, na.rm = TRUE),
            CropDmgAmt.Sum = sum(CropDmgAmt, na.rm = TRUE)
    ) %>%
    top_n(10) %>%
    melt(id = c("Designator", "Event"))

p4 <- treemap(p4s,
            index = c("Designator", "Event", "variable"),
            vSize = "value",
            type = "index",
            palette = "Set2",
            bg.labels = c("white"),
            align.labels = list(
              c("center", "center"),
              c("right", "bottom")
            )
          )

d3tree2(p4, rootname = "United States 1950 - 2011 property and crop damage by event (C - County | Z - Zone | M - Marine)")

2.3 Plot total damage to map across the United States

We noted the fact, events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

So in order to overview total damage occur over years to crossing the United States, 80 top result are selected from all the states in each years. Then bubble plot representing the damage amount can be made to US map, run against the year timestamp.

lonlat <- mutate_geocode(state_tbl, StateName)

p5s <- df %>%
    group_by(BgnYear, STATE) %>%
    slice(which.max(TotalDmgAmt)) %>%
    ungroup() %>%
    top_n(80)

p5ll <- merge(p5s, lonlat)

During the time sequence is played, it is easy to spot the color in yellow (represented the highest damage) occurred in 2006, where over 115 billion damage is recorded in California by a biggest ever flood, below is its remark for further understandings.

p5 <- ggmap(get_map(location = "USA", zoom = 4), darken = .3, 
base_layer = ggplot(data = p5ll, aes(x = lon, y = lat, frame = BgnYear, ids = STATE))) +
geom_point(data = p5ll, aes(color = TotalDmgAmt, size = TotalDmgAmt, alpha = .7)) +
scale_size(range = c(4, 14)) +
scale_color_viridis_c()

ggplotly(p5)

p5ll %>% slice_max(TotalDmgAmt)

##   STATE Event Designator STATE__         BGN_DATE    BGN_TIME TIME_ZONE COUNTY
## 1    CA Flood          C       6 1/1/2006 0:00:00 12:00:00 AM       PST     55
##   COUNTYNAME EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI         END_DATE    END_TIME
## 1       NAPA  FLOOD         0    <NA> COUNTYWIDE 1/1/2006 0:00:00 07:00:00 AM
##   COUNTY_END COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH  F MAG
## 1          0         NA         0    <NA> COUNTYWIDE      0     0 NA   0
##   FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO
## 1          0        0     115          B    32.5          M MTR
##            STATEOFFIC ZONENAMES LATITUDE LONGITUDE LATITUDE_E LONGITUDE_
## 1 CALIFORNIA, Western      <NA>     3828     12218       3828      12218
##                                                                                                                                                                                                                                                                                                                                                                                          REMARKS
## 1 Major flooding continued into the early hours of January 1st, before the Napa River finally fell below flood stage and the water receeded. Flooding was severe in Downtown Napa from the Napa Creek and the City and Parks Department was hit with $6 million in damage alone. The City of Napa had 600 homes with moderate damage, 150 damaged businesses with costs of at least $70 million.
##   REFNUM    BgnDate    EndDate BgnYear EndYear PropDmgAmt CropDmgAmt
## 1 605943 2006-01-01 2006-01-01    2006    2006   1.15e+11   32500000
##    TotalDmgAmt  StateName Abbreviation       lon      lat
## 1 115032500000 California       Calif. -119.4179 36.77826

3. SUMMARY

In this project, we accomplished answering the two initial questions. Which types of events are most harmful to population health? - Tornado Which types of events have the greatest economic consequences? - Flood

Not only that we also made a deep dive into time across geolocation, and found the biggest factor to our result typically lying under specific locaiton by specific time.

Futher reading of the story, here is Northern California Floods from NASA