Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
There are two questions we need to carry in the analysis:
- Which types of events are most harmful to population health?
- Which types of events have the greatest economic consequences?
After exploring for the data over years, we noted tornado and flood are the answers, the steps below demostrate how the analysis is reproduced.
Preload of packages we need. mainly from Tidyverse and Plotly ggplot2
knitr::opts_chunk$set(echo = TRUE)
library(R.utils)
library(readr)
library(tidyr)
library(lubridate)
library(stringr)
library(dplyr)
library(ggplot2)
library(reshape2)
library(gridExtra)
library(treemap)
library(d3treeR)
library(ggmap)
library(plotly)
library(GGally)
The data for this project come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. Aside of this we added event tibble for better mapping of all the events appears in the data, and states abbreviations for mapping of the geolocations.
if (!file.exists("storm.csv")) {
download.file('https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2', 'storm.csv.bz2', method='curl' )
bunzip2(filename="storm.csv.bz2", "storm.csv")
}
storm <- read_csv("storm.csv", col_name = TRUE)
evnt_tbl <- read_csv("storm-data-event-table.csv", col_name = TRUE)
state_tbl <- read_csv("state.csv", col_name = TRUE)
When profiling data, we immediately found that EVTYPE is the descriptive keyword we need to base, but there are 977 distinct values. Most of them are dirty and need for clean up due to several reasons:
To solve this, we created a new event column so:
storm_event <- storm %>%
# Summary should be filter away
filter(!str_detect(EVTYPE, "SUMMARY.*") & !str_detect(EVTYPE, "Summary.*")) %>%
# Build new column name it Event for data cleansing while original column EVTYPE is kept for reference
mutate(Event = EVTYPE) %>%
# Remove measures description to align the event name
mutate(Event = str_replace_all(Event, "G[0-9].*", "")) %>%
mutate(Event = str_replace_all(Event, "F[0-9].*", "")) %>%
mutate(Event = str_replace_all(Event, "[0-9].*", "")) %>%
mutate(Event = str_replace_all(Event, "\\(.*", "")) %>%
mutate(Event = str_to_lower(Event)) %>%
# Collect events starting with specific words
mutate(Event = str_replace_all(Event, regex("^avalan.*"), "Avalanche")) %>%
mutate(Event = str_replace_all(Event, regex("^blizzard.*"), "Blizzard")) %>%
mutate(Event = str_replace_all(Event, regex("^coastal.*"), "Coastal Flood")) %>%
mutate(Event = str_replace_all(Event, regex("^cold.*"), "Cold/Wind Chill")) %>%
mutate(Event = str_replace_all(Event, regex("^record.*?cold.*"), "Cold/Wind Chill")) %>%
mutate(Event = str_replace_all(Event, regex("^land.*"), "Debris Flow")) %>%
mutate(Event = str_replace_all(Event, regex("^mudslide$"), "Debris Flow")) %>%
mutate(Event = str_replace_all(Event, regex("^drought.*"), "Drought")) %>%
mutate(Event = str_replace_all(Event, regex("^dry.*"), "Drought")) %>%
mutate(Event = str_replace_all(Event, regex("^dust d.*"), "Dust Devil")) %>%
mutate(Event = str_replace_all(Event, regex("^extreme.*?cold.*"), "Extreme Cold/Wind Chill")) %>%
mutate(Event = str_replace_all(Event, regex("^ext.*?cold/windchill.*"), "Extreme Cold/Wind Chill")) %>%
mutate(Event = str_replace_all(Event, regex("^extreme wind.*"), "Extreme Cold/Wind Chill")) %>%
mutate(Event = str_replace_all(Event, regex("^flash flood.*"), "Flash Flood")) %>%
mutate(Event = str_replace_all(Event, regex("^flood.*"), "Flood")) %>%
mutate(Event = str_replace_all(Event, regex("^river flood$"), "Flood")) %>%
mutate(Event = str_replace_all(Event, regex("^urban/small stream flood$"), "Flood")) %>%
mutate(Event = str_replace_all(Event, regex("^freez.*"), "Frost/Freeze")) %>%
mutate(Event = str_replace_all(Event, regex("^frost.*"), "Frost/Freeze")) %>%
mutate(Event = str_replace_all(Event, regex("^fog.*"), "Freezing Fog")) %>%
mutate(Event = str_replace_all(Event, regex("^funnel.*"), "Funnel Cloud")) %>%
mutate(Event = str_replace_all(Event, regex("^hail.*"), "Hail")) %>%
mutate(Event = str_replace_all(Event, regex("^small hail$"), "Hail")) %>%
mutate(Event = str_replace_all(Event, regex("^extreme heat.*"), "Heat")) %>%
mutate(Event = str_replace_all(Event, regex("^record.*?heat.*"), "Heat")) %>%
mutate(Event = str_replace_all(Event, regex("^heat.*"), "Heat")) %>%
mutate(Event = str_replace_all(Event, regex("^heavy rain.*"), "Heavy Rain")) %>%
mutate(Event = str_replace_all(Event, regex("^heavy.*?snow.*"), "Heavy Snow")) %>%
mutate(Event = str_replace_all(Event, regex("^excessive snow$"), "Heavy Snow")) %>%
mutate(Event = str_replace_all(Event, regex("^high surf.*"), "High Surf")) %>%
mutate(Event = str_replace_all(Event, regex("^high wind.*"), "High Wind")) %>%
mutate(Event = str_replace_all(Event, regex("^wind.*"), "High Wind")) %>%
mutate(Event = str_replace_all(Event, regex("^hurricane.*"), "Hurricane (Typhoon)")) %>%
mutate(Event = str_replace_all(Event, regex("^typhoon.*"), "Hurricane (Typhoon)")) %>%
mutate(Event = str_replace_all(Event, regex("^ice.*"), "Ice Storm")) %>%
mutate(Event = str_replace_all(Event, regex("^lake.*?effect snow$"), "Lake-Effect Snow")) %>%
mutate(Event = str_replace_all(Event, regex("^lightning.*"), "Lightning")) %>%
mutate(Event = str_replace_all(Event, regex("^marine tstm.*"), "Marine Thunderstorm Wind")) %>%
mutate(Event = str_replace_all(Event, regex("^rip current.*"), "Rip Current")) %>%
mutate(Event = str_replace_all(Event, regex("^sleet.*"), "Sleet")) %>%
mutate(Event = str_replace_all(Event, regex("^storm surge.*"), "Storm Surge/Tide")) %>%
mutate(Event = str_replace_all(Event, regex("^strong high wind$"), "Strong Wind")) %>%
mutate(Event = str_replace_all(Event, regex("^strong wind.*"), "Strong Wind")) %>%
mutate(Event = str_replace_all(Event, regex("^thun.*"), "Thunderstorm Wind")) %>%
mutate(Event = str_replace_all(Event, regex("^tstm.*"), "Thunderstorm Wind")) %>%
mutate(Event = str_replace_all(Event, regex("^severe thunderstorms$"), "Thunderstorm Wind")) %>%
mutate(Event = str_replace_all(Event, regex("^torn.*"), "Tornado")) %>%
mutate(Event = str_replace_all(Event, regex("^tropical storm.*"), "Tropical Storm")) %>%
mutate(Event = str_replace_all(Event, regex("^volcanic.*"), "Volcanic Ash")) %>%
mutate(Event = str_replace_all(Event, regex("^water.*"), "Waterspout")) %>%
mutate(Event = str_replace_all(Event, regex("^wild.*?fire.*"), "Wildfire")) %>%
mutate(Event = str_replace_all(Event, regex("^winter s.*"), "Winter Storm")) %>%
mutate(Event = str_replace_all(Event, regex("^winter w.*"), "Winter Weather")) %>%
mutate(Event = str_replace_all(Event, regex("^snow.*"), "Winter Weather")) %>%
mutate(Event = str_replace_all(Event, regex("^light snow$"), "Winter Weather")) %>%
mutate(Event = str_replace_all(Event, regex("^wint.*?mix"), "Winter Weather")) %>%
# Collect events ending with specific words
mutate(Event = str_replace_all(Event, regex(".*drought$"), "Drought")) %>%
mutate(Event = str_replace_all(Event, regex(".*dry$"), "Drought")) %>%
mutate(Event = str_replace_all(Event, regex(".*flooding.*"), "Flood")) %>%
mutate(Event = str_replace_all(Event, regex(".*fld.*"), "Flood")) %>%
mutate(Event = str_replace_all(Event, regex(".*high surf$"), "High Surf")) %>%
mutate(Event = str_replace_all(Event, regex(".*thunderstorm$"), "Thunderstorm Wind")) %>%
mutate(Event = str_replace_all(Event, regex(".*winter storm$"), "Winter Storm")) %>%
mutate(Event = str_replace_all(Event, regex(".*winter weather$"), "Winter Weather")) %>%
mutate(Event = str_to_title(Event)) %>%
as.data.frame()
For a preview of prepared data, we now see
- 977 EVTYPE is reduced to 317 Events with major event categorized ready
- all 48 event categorized and shows up
storm_event %>% count(Event, sort = TRUE) %>% tibble()
## # A tibble: 317 x 2
## Event n
## <chr> <int>
## 1 Thunderstorm Wind 324815
## 2 Hail 288829
## 3 Tornado 60687
## 4 Flash Flood 55038
## 5 Flood 29876
## 6 High Wind 22252
## 7 Heavy Snow 15843
## 8 Lightning 15766
## 9 Marine Thunderstorm Wind 11987
## 10 Heavy Rain 11801
## # ... with 307 more rows
df <- merge(evnt_tbl, storm_event)
df %>% count(Event, sort = TRUE)
## Event n
## 1 Thunderstorm Wind 324815
## 2 Hail 288829
## 3 Tornado 60687
## 4 Flash Flood 55038
## 5 Flood 29876
## 6 High Wind 22252
## 7 Heavy Snow 15843
## 8 Lightning 15766
## 9 Marine Thunderstorm Wind 11987
## 10 Heavy Rain 11801
## 11 Winter Storm 11439
## 12 Winter Weather 9233
## 13 Funnel Cloud 6980
## 14 Wildfire 4231
## 15 Waterspout 3860
## 16 Strong Wind 3775
## 17 Drought 2802
## 18 Blizzard 2728
## 19 Ice Storm 2094
## 20 Extreme Cold/Wind Chill 1894
## 21 Frost/Freeze 1831
## 22 Excessive Heat 1678
## 23 Dense Fog 1293
## 24 High Surf 968
## 25 Heat 956
## 26 Coastal Flood 863
## 27 Rip Current 777
## 28 Cold/Wind Chill 729
## 29 Tropical Storm 697
## 30 Lake-Effect Snow 659
## 31 Debris Flow 630
## 32 Freezing Fog 539
## 33 Marine Hail 442
## 34 Dust Storm 427
## 35 Storm Surge/Tide 409
## 36 Avalanche 387
## 37 Hurricane (Typhoon) 299
## 38 Astronomical Low Tide 174
## 39 Dust Devil 151
## 40 Marine High Wind 135
## 41 Sleet 78
## 42 Tropical Depression 60
## 43 Marine Strong Wind 48
## 44 Volcanic Ash 29
## 45 Lakeshore Flood 23
## 46 Seiche 21
## 47 Tsunami 20
## 48 Dense Smoke 10
While looking into economic consequences, we noted PROPDMG * PROPDMGEXP and CROPDMG * CROPDMGEXP are two pairs.
The EXP seems have stored the amount explanations like hundred, thousand, million and billion.
Therefore next transformation is just for monetary conversion.
For number 0-8, unfortunately there isn’t much explanation we could find in the document, therefore they are treated as 1 to our result. i.e. no effects
unique(df$PROPDMGEXP)
## [1] "K" NA "M" "?" "0" "7" "4" "5" "1" "B" "+" "2" "8" "H" "m" "-" "3" "6" "h"
unique(df$CROPDMGEXP)
## [1] "K" NA "M" "B" "0" "k" "?" "m" "2"
df <- df %>%
mutate(BgnDate = as_date(BGN_DATE, format = "%m/%d/%Y")) %>%
mutate(EndDate = as_date(END_DATE, format = "%m/%d/%Y")) %>%
mutate(BgnYear = year(BgnDate)) %>%
mutate(EndYear = year(EndDate)) %>%
mutate(PropDmgAmt =
case_when(PROPDMGEXP %in% c("H", "h") ~ PROPDMG * 10^2,
PROPDMGEXP %in% c("K", "k") ~ PROPDMG * 10^3,
PROPDMGEXP %in% c("M", "m") ~ PROPDMG * 10^6,
PROPDMGEXP %in% c("B", "b") ~ PROPDMG * 10^9,
TRUE ~ PROPDMG)
) %>%
mutate(CropDmgAmt =
case_when(CROPDMGEXP %in% c("H", "h") ~ CROPDMG * 10^2,
CROPDMGEXP %in% c("K", "k") ~ CROPDMG * 10^3,
CROPDMGEXP %in% c("M", "m") ~ CROPDMG * 10^6,
CROPDMGEXP %in% c("B", "b") ~ CROPDMG * 10^9,
TRUE ~ CROPDMG)
) %>%
mutate(TotalDmgAmt = PropDmgAmt + CropDmgAmt)
To have quick result first we select top 10 events out of 48, sorting by the injuries sum.
We see tornado has way more harms to population than the following rest 9 events.
df %>% count(STATE, sort = TRUE) %>% tibble()
## # A tibble: 72 x 2
## STATE n
## <chr> <int>
## 1 TX 83689
## 2 KS 53419
## 3 OK 46726
## 4 MO 35632
## 5 IA 30967
## 6 NE 30249
## 7 IL 28458
## 8 AR 27095
## 9 NC 25307
## 10 GA 25234
## # ... with 62 more rows
p1 <- df %>%
group_by(Event) %>%
summarise(
Fatalities.Sum = sum(FATALITIES, na.rm = TRUE),
Injuries.Sum = sum(INJURIES, na.rm = TRUE)
) %>%
top_n(10) %>%
melt(id = "Event")
ggplot(data = p1, aes(x = reorder(Event, value), y = value, fill = variable)) +
geom_bar(position = "dodge", stat = "identity") + coord_flip() +
theme_minimal()
To explore in more detail, a tree map is plotted with 3 levels:
1. Designator C = County | Z - Zone | M - Marine 2. All 48 events 3. Number of injuries | fatalities Thus in each level, area represents their harmed population accordingly.
p2s <- df %>%
group_by(Designator, Event) %>%
summarise(
Fatalities.Sum = sum(FATALITIES, na.rm = TRUE),
Injuries.Sum = sum(INJURIES, na.rm = TRUE)
) %>%
top_n(10) %>%
melt(id = c("Designator", "Event"))
p2 <- treemap(p2s,
index = c("Designator", "Event", "variable"),
vSize = "value",
type = "index",
palette = "Set2",
bg.labels = c("white"),
align.labels = list(
c("center", "center"),
c("right", "bottom")
)
)
d3tree2(p2, rootname = "United States 1950 - 2011 number of injuries and fatalities by event (C - County | Z - Zone | M - Marine)")
For quick result top 10 events out of 48 shows, we can see flood and hurricane are the top 2 in damage to property wherease drought is the top 1 in damage to crop.
p3 <- df %>%
group_by(Event) %>%
summarise(
PropDmgAmt.Sum = sum(PropDmgAmt, na.rm = TRUE),
CropDmgAmt.Sum = sum(CropDmgAmt, na.rm = TRUE)
) %>%
top_n(10) %>%
melt(id = "Event")
ggplot(data = p3, aes(x = reorder(Event, value), y = value, fill = variable)) +
geom_bar(position = "dodge", stat = "identity") + coord_flip() +
theme_minimal()
Same again, a tree map is plotted with 3 levels:
1. Designator C = County | Z - Zone | M - Marine
2. All 48 events 3. Damage to property | crops Thus in each level, area represents their damaged monetary amount.
p4s <- df %>%
group_by(Designator, Event) %>%
summarise(
PropDmgAmt.Sum = sum(PropDmgAmt, na.rm = TRUE),
CropDmgAmt.Sum = sum(CropDmgAmt, na.rm = TRUE)
) %>%
top_n(10) %>%
melt(id = c("Designator", "Event"))
p4 <- treemap(p4s,
index = c("Designator", "Event", "variable"),
vSize = "value",
type = "index",
palette = "Set2",
bg.labels = c("white"),
align.labels = list(
c("center", "center"),
c("right", "bottom")
)
)
d3tree2(p4, rootname = "United States 1950 - 2011 property and crop damage by event (C - County | Z - Zone | M - Marine)")
We noted the fact, events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
So in order to overview total damage occur over years to crossing the United States, 80 top result are selected from all the states in each years. Then bubble plot representing the damage amount can be made to US map, run against the year timestamp.
lonlat <- mutate_geocode(state_tbl, StateName)
p5s <- df %>%
group_by(BgnYear, STATE) %>%
slice(which.max(TotalDmgAmt)) %>%
ungroup() %>%
top_n(80)
p5ll <- merge(p5s, lonlat)
During the time sequence is played, it is easy to spot the color in yellow (represented the highest damage) occurred in 2006, where over 115 billion damage is recorded in California by a biggest ever flood, below is its remark for further understandings.
p5 <- ggmap(get_map(location = "USA", zoom = 4), darken = .3,
base_layer = ggplot(data = p5ll, aes(x = lon, y = lat, frame = BgnYear, ids = STATE))) +
geom_point(data = p5ll, aes(color = TotalDmgAmt, size = TotalDmgAmt, alpha = .7)) +
scale_size(range = c(4, 14)) +
scale_color_viridis_c()
ggplotly(p5)
p5ll %>% slice_max(TotalDmgAmt)
## STATE Event Designator STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY
## 1 CA Flood C 6 1/1/2006 0:00:00 12:00:00 AM PST 55
## COUNTYNAME EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME
## 1 NAPA FLOOD 0 <NA> COUNTYWIDE 1/1/2006 0:00:00 07:00:00 AM
## COUNTY_END COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG
## 1 0 NA 0 <NA> COUNTYWIDE 0 0 NA 0
## FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO
## 1 0 0 115 B 32.5 M MTR
## STATEOFFIC ZONENAMES LATITUDE LONGITUDE LATITUDE_E LONGITUDE_
## 1 CALIFORNIA, Western <NA> 3828 12218 3828 12218
## REMARKS
## 1 Major flooding continued into the early hours of January 1st, before the Napa River finally fell below flood stage and the water receeded. Flooding was severe in Downtown Napa from the Napa Creek and the City and Parks Department was hit with $6 million in damage alone. The City of Napa had 600 homes with moderate damage, 150 damaged businesses with costs of at least $70 million.
## REFNUM BgnDate EndDate BgnYear EndYear PropDmgAmt CropDmgAmt
## 1 605943 2006-01-01 2006-01-01 2006 2006 1.15e+11 32500000
## TotalDmgAmt StateName Abbreviation lon lat
## 1 115032500000 California Calif. -119.4179 36.77826
In this project, we accomplished answering the two initial questions. Which types of events are most harmful to population health? - Tornado Which types of events have the greatest economic consequences? - Flood
Not only that we also made a deep dive into time across geolocation, and found the biggest factor to our result typically lying under specific locaiton by specific time.
Futher reading of the story, here is Northern California Floods from NASA
.