The report provides average healthĀ harm and economicĀ damage caused by different extreme weather events in theĀ USA. TheĀ dataĀ set used for investigation is theĀ USĀ National Oceanic and Atmospheric Administrationās stormĀ database (related stormĀ data documentation).
TheĀ most harmful weatherĀ event for humanĀ health is tornado F5 (F5 stands for tornado category according to FujitaĀ scaleĀ (FāScale)). AnĀ average F5Ā tornado causes oneĀ hundred injuries among which 9% are fatal cases. TheĀ biggest economic consequences are caused by hurricanes. AnĀ average hurricane causes over 400Ā millions dollars of economicĀ damage. Tornado F5 in this case takes 2ndĀ place with over than 130Ā millions of economicĀ damage. TheĀ economic impact values are calculated by 2016Ā yearās USĀ dollar buyingĀ power equivalent.
The stormĀ database mentioned above is used for theĀ analysis (last successful download: FebĀ 20th,Ā 2017, 9:20Ā EET). There are several steps performed to preprocess and analyse theĀ dataĀ set:
The storm database is aĀ large .csv file compressed into .bz2 archive. It contains 37Ā variables and has near aĀ million records. TheĀ whole dataĀ set loaded into R takes over 400Ā Mb of RAM.
There is noĀ need to load all variables into memory. TheĀ needed ones are:
BGN_DATE: TheĀ date when aĀ given weatherĀ event (EVTYPE variable) begin. TheĀ year value of each record of BGN_DATE variable is used to determine annual Consumer Price IndexĀ (CPI) with aĀ view to convert economicĀ damage of aĀ given weatherĀ eventās year to 2016Ā year equivalent.
EVTYPE: Weather event type.
F: Tornadoes category by FujitaĀ scale.
FATALITIES: Deaths caused by aĀ given EVTYPE.
INJURIES: Injuries caused by aĀ given EVTYPE.
PROPDMG: Property damage caused by aĀ given EVTYPE. Actual value depends on PROPDMGEXP value.
PROPDMGEXP: Determines PROPDMG absolute value (in most cases),Ā i.e.
PropertyDamageĀ = PROPDMGĀ * 10PROPDMGEXP.
CROPDMG: Crop damage caused by aĀ given EVTYPE. Actual value depends on CROPDMGEXP value.
CROPDMGEXP: Determines CROPDMG absolute value (in moste cases),Ā i.e.
CropDamageĀ = CROPDMGĀ * 10CROPDMGEXP.
storm_data <-
read.table(
file = "repdata_data_StormData.csv.bz2",
header = TRUE,
sep = ",",
na.strings = NULL,
colClasses = c(
"NULL", "character", rep("NULL", 5), "factor", rep("NULL", 12),
"factor", "NULL", rep("numeric", 3), "factor", "numeric", "factor",
rep("NULL", 9)
)
)
str(storm_data)
## 'data.frame': 902297 obs. of 9 variables:
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ EVTYPE : Factor w/ 985 levels "?","ABNORMALLY DRY",..: 830 830 830 830 830 830 830 830 830 830 ...
## $ F : Factor w/ 7 levels "","0","1","2",..: 5 4 4 4 4 4 4 3 5 5 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
There are 48 official event types according to section 2.1.1Ā Storm Data Event Table of Storm Data Preparation document, link to which is mentioned at the beginning (last successful download: FebĀ 20th,Ā 2017, 9:20Ā EET). However storm_data dataĀ frame contains
library(dplyr)
unique(storm_data$EVTYPE) %>% length()
## [1] 985
unique event types, which are basically:
typos;
official event types with additional text added;
several (2 and 3) officialĀ eventĀ types recorded in aĀ single row;
event types that are notĀ listed in official event types.
unique(storm_data$EVTYPE) %>% head(20)
## [1] TORNADO TSTM WIND
## [3] HAIL FREEZING RAIN
## [5] SNOW ICE STORM/FLASH FLOOD
## [7] SNOW/ICE WINTER STORM
## [9] HURRICANE OPAL/HIGH WINDS THUNDERSTORM WINDS
## [11] RECORD COLD HURRICANE ERIN
## [13] HURRICANE OPAL HEAVY RAIN
## [15] LIGHTNING THUNDERSTORM WIND
## [17] DENSE FOG RIP CURRENT
## [19] THUNDERSTORM WINS FLASH FLOOD
## 985 Levels: ? ABNORMALLY DRY ABNORMALLY WET ... WND
This report implies officialĀ eventĀ types analysing.
## Create vector which lists all official event types
official_events <-
factor(
c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood",
"Cold/Wind Chill", "Debris Flow", "Dense Fog", "Dense Smoke",
"Drought", "Dust Devil", "Dust Storm", "Excessive Heat",
"Extreme Cold/Wind Chill", "Flash Flood", "Flood", "Frost/Freeze",
"Funnel Cloud", "Freezing Fog", "Hail", "Heat", "Heavy Rain",
"Heavy Snow", "High Surf", "High Wind", "Hurricane (Typhoon)",
"Ice Storm", "Lake-Effect Snow", "Lakeshore Flood", "Lightning",
"Marine Hail", "Marine High Wind", "Marine Strong Wind",
"Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet",
"Storm Surge/Tide", "Strong Wind", "Thunderstorm Wind", "Tornado",
"Tropical Depression", "Tropical Storm", "Tsunami", "Volcanic Ash",
"Waterspout", "Wildfire", "Winter Storm", "Winter Weather")
)
The Stage 2 section task is to determine as much as possible officialĀ eventĀ types. Here is theĀ vector of all uniqueĀ eventĀ types (called unique_events in opposite to official_events vector):
## The vector will be reduced each time some of its events are filtered
## step by step. So at the end of stage 2 the vector will be less than at
## the beginning.
unique_events <- unique(storm_data$EVTYPE)
First of all there might be some officialĀ eventĀ types which have (almost) exact matching with some of theĀ uniqueĀ eventĀ types:
matched_events <-
sapply(
## Every official event is surrounded by regex (i.e. forms a regex)
paste(
## Unique event name may begin with non character symbols
"^[^a-z]*",
official_events,
## After unique event name there may be (in the presented order):
## - Word ending 's', 'es' or 'ing'.
## - Any quantity of non character symbols: spaces in most cases,
## but there also may be '-', '/' etc.
## - Any single letter: some unique events are marked with a single
## letter.
## - Any quantity of non character symbols: in most cases these are
## numbers.
## - "mph": this concerns to thunderstorm wind event.
## - Any quantity of non character symbols: in most cases this is ')'.
"(s|es|ing)?[^a-z]*[a-z]?[^a-z]*(mph)?[^a-z]*$",
sep = ""
),
FUN = grepl,
unique_events,
ignore.case = TRUE
)
dim(matched_events)
## [1] 985 48
Last code snippet produced logical matrix where each row represents aĀ single uniqueĀ eventĀ type and each columnĀ ā single officialĀ eventĀ type. TRUEĀ values in this matrix mean that aĀ given uniqueĀ eventĀ type (row) matched with aĀ given officialĀ eventĀ type (column).
Letās see if there any matches:
## Iterate through rows (unique event types) to find matching with official
## event type
apply(matched_events, MARGIN = 1, FUN = function(matrix_row) any(matrix_row)) %>%
## Retrieve only TRUEs
.[.] %>%
length()
## [1] 160
So matched values need to be retrieved. Letās create aĀ dataĀ frame which will contain uniqueĀ eventĀ types that matched with officialĀ ones:
## Data frame which contains sorted unique event types related to
## official event types
sorted_events <-
data.frame(
official_event = factor(levels = official_events),
unique_event = factor(levels = levels(storm_data$EVTYPE)),
## A lot of unique event types consist of several official event types
## (e.g. unique event 'ICE STORM/FLASH FLOOD' consists of two
## official event types: Ice Storm and Flash Flood).
## The variable declared describes how many official events are included
## in a given unique_event. This value will serve as damage and injuries
## divider to split these values among official event types
events_involved = integer(),
stringsAsFactors = FALSE
)
Letās fill sorted_events with matched values:
library(magrittr)
## Iterate through 'matched_events' matrix's rows to fill in 'sorted_events'
## data frame with exatly matched unique event names
for (i in 1:nrow(matched_events))
{
## Index of 'official_events' vector which matched with record from
## 'unique_events'
occurence_index <- match(TRUE, table = matched_events[i, ])
## If there is TRUE value in a given matrix's row
if (!is.na(occurence_index))
{
sorted_events %<>%
summarise(
official_event = official_events[occurence_index],
unique_event = unique_events[i],
events_involved = 1
) %>%
bind_rows(sorted_events, .)
}
}
## How does 'sorted_events' data frame look like?
head(sorted_events)
## official_event unique_event events_involved
## 1 Tornado TORNADO 1
## 2 Hail HAIL 1
## 3 Winter Storm WINTER STORM 1
## 4 Thunderstorm Wind THUNDERSTORM WINDS 1
## 5 Heavy Rain HEAVY RAIN 1
## 6 Lightning LIGHTNING 1
## Remove unique events that are already sorted out to 'sorted_events'
## data frame
unique_events <-
unique_events[
!apply(
matched_events,
MARGIN = 1,
FUN = function(matrix_row) any(matrix_row)
)
]
## How many unsorted unique events remain?
length(unique_events)
## [1] 825
As mentioned above there are uniqueĀ events which contain several (2 and 3) events recorded in aĀ single row. Letās find uniqueĀ eventĀ types which contain two events atĀ once:
for (i in seq_along(official_events))
{
matched_events <-
sapply(
## Two official event types are surrounded by regex
paste(
## Unique event type may begin with any non character signs
## quantity
"^[^a-z]*",
official_events[i],
## - First official event type may end with 's', 'es' or 'ing'.
## - There must be '/', '&' or 'and' between two
## official event types.
## - There also may be any quantity of non character signs.
"(s|es|ing)?[^a-z]*(\\/|&|\\band\\b)[^a-z]*",
official_events,
## - Second official event type may end with 's', 'es' or 'ing'.
## - Unique event type may end with any non character signs
## quantity.
"(s|es|ing)?[^a-z]*$",
sep = ""
),
FUN = grepl,
unique_events,
ignore.case = TRUE
)
## 'unique_events' vector indices which have matched during a given
## loop iteration.
occurence_indices <- integer()
## Find matched patterns among 'unique_events' vector
for (matrix_row in 1:nrow(matched_events))
{
occurence_index <- match(TRUE, table = matched_events[matrix_row, ])
## If there is an index present in a given vector element
if (!is.na(occurence_index))
{
## If same official event type is written twice in a given
## unique event type
if (occurence_index == i)
{
warning("Following unique event type consists of the same",
" official event type, ", official_events[i],
", written twice:\n", unique_events[matrix_row])
next
}
occurence_indices[length(occurence_indices) + 1] <- matrix_row
## Add both official event types involved in unique event type
sorted_events %<>%
summarise(
official_event = official_events[i],
unique_event = unique_events[matrix_row],
events_involved = 2
) %>%
bind_rows(sorted_events, .)
sorted_events %<>%
summarise(
official_event = official_events[occurence_index],
unique_event = unique_events[matrix_row],
events_involved = 2
) %>%
bind_rows(sorted_events, .)
}
}
## If there are matched elements that need to be removed
if (length(occurence_indices))
{
unique_events <- unique_events[-c(occurence_indices)]
}
}
## How many sorted events now? This time each matched unique event adds
## two records to 'sorted_events' data frame. This because we've found
## unique events that contain two official events in a single record
nrow(sorted_events)
## [1] 274
## How do newly added rows look like?
tail(sorted_events)
## official_event unique_event events_involved
## 269 Waterspout WATERSPOUT/ TORNADO 2
## 270 Tornado WATERSPOUT/ TORNADO 2
## 271 Winter Storm WINTER STORM/HIGH WIND 2
## 272 High Wind WINTER STORM/HIGH WIND 2
## 273 Winter Storm WINTER STORM/HIGH WINDS 2
## 274 High Wind WINTER STORM/HIGH WINDS 2
## How many unsorted unique events remain?
length(unique_events)
## [1] 768
In theĀ previous step we haveĀ been searching for uniqueĀ eventĀ types which contain two events atĀ once. At this step we try to find three officialĀ events in aĀ single uniqueĀ event:
for (event1 in seq_along(official_events))
{
for (event2 in seq_along(official_events))
{
matched_events <-
sapply(
## Three official event types are surrounded by regex
paste(
## Unique event type may begin with any non character signs
## quantity
"^[^a-z]*",
official_events[event1],
## - First official event type may end with 's', 'es' or
## 'ing'.
## - There must be '/', '&' or 'and' between first and
## second official event type.
## - There also may be any quantity of non character signs.
"(s|es|ing)?[^a-z]*(\\/|&|\\band\\b)[^a-z]*",
official_events[event2],
## - Second official event type may end with 's', 'es' or
## 'ing'.
## - There must be '/', '&' or 'and' between second and
## third official event type.
## - There also may be any quantity of non character signs.
"(s|es|ing)?[^a-z]*(\\/|&|\\band\\b)[^a-z]*",
official_events,
## - Third official event type may end with 's', 'es' or
## 'ing'.
## - Unique event type may end with anty non character signs
## quantity.
"(s|es|ing)?[^a-z]*$",
sep = ""
),
FUN = grepl,
unique_events,
ignore.case = TRUE
)
## 'unique_events' vector indices which have matched during a given
## loop iteration.
occurence_indices <- integer()
## Find matched patterns among 'unique_events' vector
for (matrix_row in 1:nrow(matched_events))
{
occurence_index <- match(TRUE, table = matched_events[matrix_row, ])
## If there is an index present in a given vector element
if (!is.na(occurence_index))
{
## If same official event type is written twice or three times
## in a given unique event type
if (occurence_index == event1 ||
occurence_index == event2 ||
event1 == event2)
{
warning("Following unique event type consists of the same",
" official event types written twice or three times:\n",
unique_events[matrix_row])
next
}
occurence_indices[length(occurence_indices) + 1] <- matrix_row
## Add all three official event types involved in
## unique event type
sorted_events %<>%
summarise(
official_event = official_events[event1],
unique_event = unique_events[matrix_row],
events_involved = 3
) %>%
bind_rows(sorted_events, .)
sorted_events %<>%
summarise(
official_event = official_events[event2],
unique_event = unique_events[matrix_row],
events_involved = 3
) %>%
bind_rows(sorted_events, .)
sorted_events %<>%
summarise(
official_event = official_events[occurence_index],
unique_event = unique_events[matrix_row],
events_involved = 3
) %>%
bind_rows(sorted_events, .)
}
}
## If there are matched elements that need to be removed
if (length(occurence_indices))
{
unique_events <- unique_events[-c(occurence_indices)]
}
}
}
## How many sorted events now? This time each matched unique event adds
## three records to 'sorted_events' data frame. This because we've found
## unique events that contain three official events in a single record
nrow(sorted_events)
## [1] 280
## How do newly added rows look like?
tail(sorted_events)
## official_event unique_event events_involved
## 275 Heavy Snow HEAVY SNOW/BLIZZARD/AVALANCHE 3
## 276 Blizzard HEAVY SNOW/BLIZZARD/AVALANCHE 3
## 277 Avalanche HEAVY SNOW/BLIZZARD/AVALANCHE 3
## 278 Heavy Snow HEAVY SNOW/HIGH WINDS & FLOOD 3
## 279 High Wind HEAVY SNOW/HIGH WINDS & FLOOD 3
## 280 Flood HEAVY SNOW/HIGH WINDS & FLOOD 3
## How many unsorted unique events remain?
length(unique_events)
## [1] 766
During this step weāllĀ investigate officialĀ events and find unique words among them, i.e.Ā weāllĀ search for theĀ words that have only single occurrence in theĀ whole official_events vector. These words will be searched in theĀ unique_eventsās elements. Additionally weāllĀ mark official_events vector duplicated words. So theĀ uniqueĀ events we need should have single unique word and noĀ duplicated words.
Looking ahead there will be three falseĀ positives Icestorm/Blizzard, EXTREMELY WET and RECORD LOW RAINFALL. With aĀ view notĀ toĀ messĀ up sorted_events letāsĀ solve this beforehand:
## At this cleaning step there is searching for unique words from official
## event types performed. "Icestorm/Blizzard" unique event type is found during
## mentioned search. But it's not appropriate to be classified as single
## official event type because it contains two events, "Ice Storm" and "Blizzard".
## That's why it needs to be classified manually before performing mentioned
## cleaning stage
sorted_events %<>%
summarise(
official_event = official_events[official_events == "Ice Storm"],
unique_event = unique_events[unique_events == "Icestorm/Blizzard"],
events_involved = 2
) %>%
bind_rows(sorted_events, .)
sorted_events %<>%
summarise(
official_event = official_events[official_events == "Blizzard"],
unique_event = unique_events[unique_events == "Icestorm/Blizzard"],
events_involved = 2
) %>%
bind_rows(sorted_events, .)
## 'RECORD LOW RAINFALL' unique event is classified as
## 'Heavy Rain' official event. Actually it's the opposite - 'Drought' :)
sorted_events %<>%
summarise(
official_event = official_events[official_events == "Drought"],
unique_event = unique_events[unique_events == "RECORD LOW RAINFALL"],
events_involved = 1
) %>%
bind_rows(sorted_events, .)
unique_events <-
unique_events[
## Also there is "EXTREMELY WET" unique event type classified as
## "Extreme Cold/Wind Chill" official event type. Nevertheless there is
## no official event type which can be classified as "EXTREMELY WET",
## hence it should be ignored
unique_events != "EXTREMELY WET" &
unique_events != "Icestorm/Blizzard" &
unique_events != "RECORD LOW RAINFALL"
]
Now letāsĀ perform searching:
## All words contained in 'official_events' vector (duplicates included)
official_events_all_words <-
strsplit(
as.character(official_events) %>% tolower(),
## Split by noncharacter symbols
split = "[^a-zA-Z]"
) %>%
unlist() %>%
## There will be an empty string produced which is not needed
.[. != ""]
head(official_events_all_words)
## [1] "astronomical" "low" "tide" "avalanche"
## [5] "blizzard" "coastal"
nonunique_words <-
official_events_all_words[duplicated(official_events_all_words)]
## Unique words that have no duplicates (i.e. have single occurrence in
## 'official_events_all_words')
official_events_unique_words <-
official_events_all_words[!(official_events_all_words %in% nonunique_words)] %>%
## - Unique word "freezing" is inappropriate in the context of
## official events unique words. Mentioned word implies
## "Freezing Fog" collocation, but such word combination is already
## filtered when searching for exactly matching patterns between
## official event types and unique event types.
## - Unique words "ice" and "weather" imply "Ice Storm" and "Winter Weather"
## official event types accordingly. But mentioned unique words are used
## in unique event types to denote not only "Ice Storm" and "Winter Weather"
## events, hence they cannot be used for unique words searching.
## - Unique word "excessive" implies "Excessive Heat" official event type.
## But this word is used as a common adjective in unique event types.
## Note. There is also "extreme" adjective which implies
## "Extreme Cold/Wind Chill" official event type. But it seems that when
## searching in unique event types for single unique word occurrence AND
## nonunique words absence, "extreme" is occurs only for its
## official event type
.[. != "freezing" & . != "ice" & . != "weather" & . != "excessive"]
head(official_events_unique_words)
## [1] "astronomical" "low" "avalanche" "blizzard"
## [5] "coastal" "debris"
## Nonunique words that have duplicates (i.e. have multiple occurrences in
## 'official_events_all_words')
official_events_nonunique_words <-
official_events_all_words[official_events_all_words %in% nonunique_words] %>%
unique()
head(official_events_nonunique_words)
## [1] "tide" "flood" "cold" "wind" "chill" "dense"
## Shows which unique words belong to which official event types
unique_words_correspondence_table <-
data.frame(
unique_word = official_events_unique_words,
official_event_type =
factor(
character(length = length(official_events_unique_words)),
levels = levels(official_events)
)
)
## The matrix which shows at which index of 'official_events' vector
## a given unique word occurs. Columns represent unique words from
## official events while rows represent indices from 'official_events'.
unique_words_in_official_events <-
sapply(
paste("\\b", official_events_unique_words, "\\b", sep = ""),
FUN = grepl,
official_events,
ignore.case = TRUE
)
## Fill 'unique_words_correspondence_table' data frame's
## 'official_event_type' variable
for (i in 1:ncol(unique_words_in_official_events))
{
## Index of element corresponding to unique word
official_event_type_index <-
match(TRUE, unique_words_in_official_events[, i])
unique_words_correspondence_table[i, ]$official_event_type <-
official_events[official_event_type_index]
}
## Retain only one unique word for a given official event type
## (i.e. official event type can have only one unique word as a representative)
unique_words_correspondence_table <-
filter(
.data = unique_words_correspondence_table,
!duplicated(official_event_type)
)
## How does created data frame look like?
head(unique_words_correspondence_table)
## unique_word official_event_type
## 1 astronomical Astronomical Low Tide
## 2 avalanche Avalanche
## 3 blizzard Blizzard
## 4 coastal Coastal Flood
## 5 debris Debris Flow
## 6 smoke Dense Smoke
nrow(unique_words_correspondence_table)
## [1] 28
## The matrix which shows where official events unique words occur in
## all unique events vector. Columns represent each unique word while rows
## represent each of total unique events
unique_words_in_unique_events <-
sapply(
## Here unique words are searched as part of whole word too, and
## not as the distinct words only
unique_words_correspondence_table$unique_word,
FUN = grepl,
unique_events,
ignore.case = TRUE
)
dim(unique_words_in_unique_events)
## [1] 763 28
## Logical vector describing whether a given 'unique_events' element contains
## nonunique word from official event types vector or not
nonunique_word_in_unique_event <-
sapply(
paste("\\b", official_events_nonunique_words, "\\b", sep = ""),
FUN = grepl,
unique_events,
ignore.case = TRUE
) %>%
apply(
MARGIN = 1,
FUN = function(matrix_row) any(matrix_row[matrix_row])
)
length(nonunique_word_in_unique_event)
## [1] 763
## 'unique_events' vector indicies which are perceived as matched during
## official event types unique words searching
occurence_indices <- integer()
## Find unique event types which contain single unique word AND do not contain
## nonunique words from official event types
for (i in seq_along(unique_events))
{
unique_words_quantity <-
unique_words_in_unique_events[i, ][unique_words_in_unique_events[i, ]] %>%
length()
## If there is single official event types unique word present in
## a unique event type AND there are no official event types word duplicates
## present
if (unique_words_quantity == 1 & !nonunique_word_in_unique_event[i])
{
occurence_indices[length(occurence_indices) + 1] <- i
unique_word_index <- match(TRUE, unique_words_in_unique_events[i, ])
sorted_events %<>%
summarise(
official_event =
unique_words_correspondence_table[unique_word_index, ]$official_event_type,
unique_event = unique_events[i],
events_involved = 1
) %>%
bind_rows(sorted_events, .)
}
}
## If there are matched unique event types that need to be removed
if (length(occurence_indices))
{
unique_events <- unique_events[-c(occurence_indices)]
}
## How many sorted events now? This time each matched unique event adds
## single record to 'sorted_events' data frame.
nrow(sorted_events)
## [1] 342
## How do newly added rows look like?
tail(sorted_events)
## official_event unique_event events_involved
## 337 Extreme Cold/Wind Chill EXTREME WINDCHILL TEMPERATURES 1
## 338 Heavy Rain LIGHT FREEZING RAIN 1
## 339 Dense Smoke SMOKE 1
## 340 High Surf HAZARDOUS SURF 1
## 341 Hurricane (Typhoon) HURRICANE/TYPHOON 1
## 342 Volcanic Ash VOLCANIC ASHFALL 1
## How many unsorted unique events remain?
length(unique_events)
## [1] 704
Finally letās find matches between official and unique eventĀ types using amatch() function from stringdist package:
library(stringdist)
## Bigger 'maxDist' values will give false positives
matched_events <- amatch(unique_events, table = official_events, maxDist = 2)
## 'unique_events' vector indicies which are perceived as matched during
## official event types unique words searching
occurence_indices <- integer()
for (i in seq_along(matched_events))
{
## If unique event type has matched with official event type
if (!is.na(matched_events[i]))
{
occurence_indices[length(occurence_indices) + 1] <- i
sorted_events %<>%
summarise(
official_event = official_events[matched_events[i]],
unique_event = unique_events[i],
events_involved = 1
) %>%
bind_rows(sorted_events, .)
}
}
## If there are matched unique event types that need to be removed
if (length(occurence_indices))
{
unique_events <- unique_events[-c(occurence_indices)]
}
## How many sorted events now?
nrow(sorted_events)
## [1] 343
## How does newly added row look like?
tail(sorted_events, n = 1)
## official_event unique_event events_involved
## 343 Lake-Effect Snow Lake Effect Snow 1
## How many unsorted unique events remain?
length(unique_events)
## [1] 703
Cleaning summation.
Performed searching helped us find 29%
((unique(sorted_events$unique_event) %>% length()) /
(unique(storm_data$EVTYPE) %>% length())) %>%
round(digits = 2)
## [1] 0.29
of official events among uniqueĀ events. Nevertheless such amount covers 73%
((storm_data$EVTYPE %in% unique(sorted_events$unique_event) %>% .[.] %>% length()) /
nrow(storm_data)) %>%
round(digits = 2)
## [1] 0.73
of the storm data records.
## Apply obtained results to 'storm_data' data frame: retain only records that
## have determined events
storm_data %<>% filter(EVTYPE %in% unique(sorted_events$unique_event))
nrow(storm_data)
## [1] 661472
At this stage cleaned data needs to be evaluated and distributed among officialĀ eventĀ types. As was said at theĀ synopsis, economic impact values are calculated by 2016Ā yearās USĀ dollar buyingĀ power equivalent. TheĀ matter of damage values represented in theĀ stormĀ database is that they reflect USĀ dollar buyingĀ power on their year. But things that cost x in 1980 on average cost xĀ *Ā 2.95 inĀ 2017. Thus we need to determine theĀ coefficients which convert all differentĀ yearĀ values to aĀ givenĀ year equivalent. ConsumerĀ PriceĀ IndexesĀ (CPI) will help to solve this issue:
library(quantmod)
## Download Consumer Price Indexes (CPI) from
## St. Louis Federal Reserve Bank's FRED system. The data downloaded is
## saved as time series 'CPIAUCSL' object.
## Last successful downloading: Feb 20th, 2017, 11:00 EET
getSymbols.FRED(Symbols = "CPIAUCSL", env = .GlobalEnv)
## [1] "CPIAUCSL"
class(CPIAUCSL)
## [1] "xts" "zoo"
The data downloaded is aĀ timeĀ series object which represent monthlyĀ CPIs beginning from
first(CPIAUCSL)
## CPIAUCSL
## 1947-01-01 21.48
Since stormĀ dataĀ set contains values beginning from
head(storm_data$BGN_DATE, n = 1)
## [1] "4/18/1950 0:00:00"
downloaded CPIs are appropriate for our purpose. CPIAUCSLĀ timeĀ series contains monthly data
first(CPIAUCSL, n = 5)
## CPIAUCSL
## 1947-01-01 21.48
## 1947-02-01 21.62
## 1947-03-01 22.00
## 1947-04-01 22.00
## 1947-05-01 21.95
This report focuses on yearly CPI values:
library(lubridate)
## A data frame which contains yearly CPIs.
CPIs <-
apply.yearly(CPIAUCSL, FUN = mean) %>%
data.frame(year(.) %>% as.factor())
colnames(CPIs) <- c("CPI", "year")
rownames(CPIs) <- NULL
head(CPIs)
## CPI year
## 1 22.33167 1947
## 2 24.04500 1948
## 3 23.80917 1949
## 4 24.06250 1950
## 5 25.97333 1951
## 6 26.56667 1952
## Retain only year values for event date instead of whole date
storm_data$BGN_DATE %<>%
mdy_hms() %>%
year() %>%
factor(levels = levels(CPIs$year))
storm_data %<>% rename(year = BGN_DATE)
head(storm_data)
## year EVTYPE F FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 1950 TORNADO 3 0 15 25.0 K 0
## 2 1950 TORNADO 2 0 0 2.5 K 0
## 3 1951 TORNADO 2 0 2 25.0 K 0
## 4 1951 TORNADO 2 0 2 2.5 K 0
## 5 1951 TORNADO 2 0 2 2.5 K 0
## 6 1951 TORNADO 2 0 6 2.5 K 0
Now we have all necessary dataĀ sets to calculate and distribute injury and damage values among officialĀ eventĀ types:
storm_data contains row values distributed among uniqueĀ eventĀ typeshead(storm_data)
## year EVTYPE F FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 1950 TORNADO 3 0 15 25.0 K 0
## 2 1950 TORNADO 2 0 0 2.5 K 0
## 3 1951 TORNADO 2 0 2 25.0 K 0
## 4 1951 TORNADO 2 0 2 2.5 K 0
## 5 1951 TORNADO 2 0 2 2.5 K 0
## 6 1951 TORNADO 2 0 6 2.5 K 0
sorted_events is aĀ correspondence table which binds officialĀ eventĀ types with related uniqueĀ eventĀ types and determines how many officialĀ events recorded in aĀ single uniqueĀ event recordhead(sorted_events)
## official_event unique_event events_involved
## 1 Tornado TORNADO 1
## 2 Hail HAIL 1
## 3 Winter Storm WINTER STORM 1
## 4 Thunderstorm Wind THUNDERSTORM WINDS 1
## 5 Heavy Rain HEAVY RAIN 1
## 6 Lightning LIGHTNING 1
CPIs containing yearly Consumer Price Indexeshead(CPIs)
## CPI year
## 1 22.33167 1947
## 2 24.04500 1948
## 3 23.80917 1949
## 4 24.06250 1950
## 5 25.97333 1951
## 6 26.56667 1952
Letās join them together:
storm_data %<>%
## Attach CPIs to storm data set by 'year' variable.
left_join(y = CPIs) %>%
## Combine sorted event types with storm data set to determine how many
## official events involved in a given 'EVTYPE'
left_join(y = sorted_events, by = c("EVTYPE" = "unique_event")) %>%
## Retain only official event type names and remove 'EVTYPE' variable
select(year, official_event, F, FATALITIES:events_involved)
head(storm_data)
## year official_event F FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 1 1950 Tornado 3 0 15 25.0 K 0
## 2 1950 Tornado 2 0 0 2.5 K 0
## 3 1951 Tornado 2 0 2 25.0 K 0
## 4 1951 Tornado 2 0 2 2.5 K 0
## 5 1951 Tornado 2 0 2 2.5 K 0
## 6 1951 Tornado 2 0 6 2.5 K 0
## CROPDMGEXP CPI events_involved
## 1 24.06250 1
## 2 24.06250 1
## 3 25.97333 1
## 4 25.97333 1
## 5 25.97333 1
## 6 25.97333 1
PROPDMGEXP and CROPDMGEXP variables represent exponents for PROPDMG and CROPDMG accordingly. E.g.Ā theĀ final value of aĀ given PROPDMG variable will be
PROPDMG * 10PROPDMGEXP
This is true only for cases where PROPDMGEXP value can be coerced to numeric (i.e.Ā is aĀ digit). ThereĀ are also additional āEXP values among simple digits: āā, ā-ā, ā+ā and metric prefixes (hecto, kilo, mega and āBā perceived as billion):
levels(storm_data$PROPDMGEXP)
## [1] "" "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"
levels(storm_data$CROPDMGEXP)
## [1] "" "?" "0" "2" "B" "k" "K" "m" "M"
With aĀ view to convert previous yearsā values to theĀ 2016ās ones we need to calculate theĀ growthĀ factor of each previous year relatively to theĀ 2016Ā year:
## Calculate growth factor of the PCI relatively to 2016 year. Mentioned
## coefficient has annual mean value, so 2016 year is taken as the newest
## completed year.
growth_factor <-
sapply(
storm_data$CPI,
FUN = function(x) CPIs$CPI[CPIs$year == 2016] / x
)
head(growth_factor)
## [1] 9.974410 9.974410 9.240603 9.240603 9.240603 9.240603
Final step for this stage is to obtain real āDMG values. Each PROPDMG/CROPDMG value is calculated by theĀ following formula
(ā¦DMG * growth_factor * ā¦DMGEXP ) / storm_data$events_involved
storm_data %<>%
mutate(
## Final property damage formula:
## Property_damage =
## (PROPDMG * growth_factor * propertyDamageExponent) /
## officialEventsInAGivenEvent
PROPDMG = PROPDMG * growth_factor * sapply(
PROPDMGEXP,
FUN = function(x)
{
if (x == "" | x == "-" | x == "+")
{
return(1)
}
if (x == "?")
{
return(0)
}
if (x == "h" | x == "H")
{
return(10^2)
}
if (x == "K")
{
return(10^3)
}
if (x == "m" | x == "M")
{
return(10^6)
}
if (x == "B")
{
return(10^9)
}
10^as.numeric(levels(storm_data$PROPDMGEXP)[x])
}
) / events_involved,
## Final crop damage formula:
## Crop_damage =
## (CROPDMG * growth_factor * cropDamageExponent) /
## officialEventsInAGivenEvent
CROPDMG = CROPDMG * growth_factor * sapply(
CROPDMGEXP,
FUN = function(x)
{
if (x == "")
{
return(1)
}
if (x == "?")
{
return(0)
}
if (x == "k" | x == "K")
{
return(10^3)
}
if (x == "m" | x == "M")
{
return(10^6)
}
if (x == "B")
{
return(10^9)
}
10^as.numeric(levels(storm_data$CROPDMGEXP)[x])
}
) / events_involved
) %>%
## Get rid of unnecessary variables
select(-PROPDMGEXP, -CROPDMGEXP, -CPI, -events_involved)
head(storm_data)
## year official_event F FATALITIES INJURIES PROPDMG CROPDMG
## 1 1950 Tornado 3 0 15 249360.26 0
## 2 1950 Tornado 2 0 0 24936.03 0
## 3 1951 Tornado 2 0 2 231015.06 0
## 4 1951 Tornado 2 0 2 23101.51 0
## 5 1951 Tornado 2 0 2 23101.51 0
## 6 1951 Tornado 2 0 6 23101.51 0
The final stage is to analyse obtained data and visualize it. When reading our dataĀ set for processing among other variables weāveĀ alsoĀ downloaded FujitaĀ scale tornadoĀ categories,Ā F. ThatāsĀ because tornadoes consequences may differ depending on aĀ tornadoĀ severity. ForĀ example F0Ā tornado may cause relatively low damage while F5 is aĀ hugeĀ catastrophe. If we were calculating all tornadoes without their classification we would get trivial injury and damage values, which would mislead theĀ whole investigation. It follows that we need to allocate different tornado categories. There are 7Ā levels of FĀ variable:
levels(storm_data$F)
## [1] "" "0" "1" "2" "3" "4" "5"
Since we donātĀ know what category is āāĀ level and this level is rather rare
((filter(.data = storm_data, official_event == "Tornado" & F == "") %>% nrow()) /
(filter(.data = storm_data, official_event == "Tornado") %>% nrow())) %>%
round(digits = 2)
## [1] 0.03
itās better to ignore these values.
## Tornado has 6 classification types of severity, from F0 to F5. Tornadoes
## health and economic impact needs to be splitted among F categories
tornado_data <- filter(.data = storm_data, official_event == "Tornado" & F != "")
## Concatenate classification to each tornado event type
tornado_data$official_event <-
paste(tornado_data$official_event, " F", tornado_data$F, sep = "")
## Change all "Tornado" event types to "Tornado F[number]"
storm_data %<>%
filter(official_event != "Tornado") %>%
bind_rows(tornado_data)
## Warning in bind_rows_(x, .id): binding factor and character vector,
## coercing into character vector
storm_data$official_event %<>% as.factor()
tail(storm_data)
## year official_event F FATALITIES INJURIES PROPDMG CROPDMG
## 660283 2011 Tornado F2 2 0 0 266768.2 0
## 660284 2011 Tornado F1 1 0 0 533536.5 0
## 660285 2011 Tornado F1 1 0 0 106707.3 0
## 660286 2011 Tornado F0 0 0 0 0.0 0
## 660287 2011 Tornado F0 0 0 0 0.0 0
## 660288 2011 Tornado F1 1 0 2 4268291.8 0
For now itāsĀ time to create healthĀ harm and economicĀ damage plots.
## Used to create health harm plot
health_harm_summary <-
group_by(.data = storm_data, official_event) %>%
summarise(
## Health harm average: fatalities and injuries
health_harm = mean(FATALITIES + INJURIES),
## Fatality fraction in health harm average
fatality_rate = (sum(FATALITIES) / sum(FATALITIES + INJURIES))
)
There are some NaNs produced. LetāsĀ see which variables have them:
sapply(health_harm_summary, FUN = function(variable) anyNA(variable))
## official_event health_harm fatality_rate
## FALSE FALSE TRUE
There are NaNs in fatality_rate variable. ItāsĀ better to replace them byĀ 0s:
health_harm_summary$fatality_rate %<>%
sapply(
FUN = function(x)
{
if (is.nan(x))
{
return(0)
}
x
}
)
Create health harm plot:
library(ggplot2)
library(tidyr)
health_harm_plot <-
ggplot(data = health_harm_summary) +
geom_col(
mapping = aes(
x = reorder(official_event, health_harm),
y = health_harm,
fill = fatality_rate * 100
)
) +
scale_fill_continuous(
name = "Fatality rate, %",
low = "#56B1F7",
high = "#132B43"
) +
labs(
title = "Average Health Harm (Injuries + Fatalities) for Each Event Type",
x = "Event types",
y = "Average people injured"
) +
coord_flip()
## Used to create economic damage plot
economic_damage_summary <-
group_by(.data = storm_data, official_event) %>%
summarise(
property_damage = mean(PROPDMG),
crop_damage = mean(CROPDMG),
total_damage = property_damage + crop_damage
) %>%
## Rearrange data frame to represent long data instead of wide
gather(key = damage_type, value = value, property_damage:crop_damage) %>%
arrange(desc(total_damage)) %>%
## Damage values vary from 10 to over 400,000,000 US dollars. This additional
## variable, 'facet_area', is used to split different damage values among
## facets for better visualizing
mutate(
facet_area = factor(
c(rep(1L, 12), rep(2L, 38), rep(3L, 54)),
labels = c("over 10 mln.", "100 ths. - 10 mln.", "less than 100 ths."),
)
)
Create economic damage plot:
library(scales)
economic_damage_plot <-
ggplot(data = economic_damage_summary) +
geom_col(
mapping = aes(
x = reorder(official_event, value),
y = value,
fill = damage_type
)
) +
scale_y_continuous(labels = comma) +
scale_fill_discrete(
name = "",
labels = c("Crop damage", "Property damage")
) +
facet_grid(facets = ~facet_area, scales = "free_x") +
theme(axis.text.x = element_text(hjust = 1, angle = 45)) +
labs(
title = paste(
"Average Economic Damage (Property + Crop)",
"for Each Event Type, 2016 Year's Buying Power Equivalent",
sep = "\n"
),
x = "Event types",
y = "Average damage, US dollars"
) +
coord_flip()
The most dangerous weather events for humanĀ health are tornadoes of F5 and F4 categories which cause 99 and 33 injuries on average accordingly. There are 9% fatalĀ cases for Tornado F5 and 6% for Tornado F4 among people injured. Tsunamies are also have high healthĀ risks with 8 average injuries and 20% fatalĀ cases among people injured.
The other point which shouldĀ notĀ be ignored, high fatalĀ cases levels. AnĀ average recorded Sleet weather event have 0.03 people injured. However all (100%) are fatalĀ cases. Other high fatalĀ levelĀ cases which deserve attention are:
Cold/Wind Chill (89% fatalĀ cases)
Storm Surge/Tide (69% fatalĀ cases)
Avalanche (57% fatalĀ cases)
Rip Current (52% fatalĀ cases)
Following barĀ plot shows average injuries from different weatherĀ event types. ColorĀ tones show fatalĀ cases rate among injured people: darker colors mean bigger number of fatalĀ cases.
The biggest economic damage cause hurricanes. AnĀ average hurricane causes $402,048,114 of economic damage ($376,306,559 of propertyĀ damage and $25,741,556 of cropĀ damage). Tornadoes of F5 and F4 categories go next with $133,536,352 and $47,903,101 of total damage respectively.
Additional point which should be taken into account are high cropĀ damages caused by Droughts ($7,572,106) and IceĀ stroms ($4,037,989).
Following barĀ plot shows average economicĀ damage from different weatherĀ event types. Since economicĀ damage varies from 10 to over 400,000,000Ā USĀ dollars, damageĀ values are split on three groups with different xāaxisĀ scales for better readability.