To answer the two questions posed in the task:
we have summed up the corresponding variables in NOAA Storm Database by each event type.
The results show that throughout all the history of observations the most dangerous event for human lives were tornados, while floods were most harmful for the economics.
library(dplyr)
library(readr)
library(tidyr)
library(ggplot2)
First we obtain the raw data from the given URL (unless we did already):
if (!dir.exists('data')) {
download.file('https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2', 'data.csv.bz2')
unzip('data.csv.bz2', exdir = 'data')
}
We read the data directly from the archive with readr::read_csv:
raw_data <- read_csv('data.csv.bz2', progress = FALSE)
## Parsed with column specification:
## cols(
## .default = col_character(),
## STATE__ = col_double(),
## COUNTY = col_double(),
## BGN_RANGE = col_double(),
## COUNTY_END = col_double(),
## END_RANGE = col_double(),
## LENGTH = col_double(),
## WIDTH = col_double(),
## F = col_integer(),
## MAG = col_double(),
## FATALITIES = col_double(),
## INJURIES = col_double(),
## PROPDMG = col_double(),
## CROPDMG = col_double(),
## LATITUDE = col_double(),
## LONGITUDE = col_double(),
## LATITUDE_E = col_double(),
## LONGITUDE_ = col_double(),
## REFNUM = col_double()
## )
## See spec(...) for full column specifications.
According to the codebook,
[Damage] estimates are rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.
However, we observe more kinds of values in PROPDMGEXP and CROPDMGEXP columns:
table(c(raw_data$PROPDMGEXP, raw_data$CROPDMGEXP))
##
## + - 0 1 2 3 4 5 6 7
## 5 1 235 25 14 4 4 28 4 5
## 8 ? B H K M h k m
## 1 15 49 6 706497 13324 1 21 8
We treat digit values as 10 ^ x, and accept both lower- and uppercase K, M, B count modifiers. The rest of them we drop, as shown in the table above, those are rare.
normalizeValues <- function(values, values_exp) {
modifier <- values_exp
modifier[modifier %in% c('k', 'K')] <- 3
modifier[modifier %in% c('m', 'M')] <- 6
modifier[modifier %in% c('b', 'B')] <- 9
modifier <- as.numeric(modifier)
values * (10 ^ modifier)
}
We read the data directly from the archive with readr::read_csv and normalize damage costs and described above:
Now we choose only the columns which characterize damage, give them lowercase names and normalize the damage cost values as described above. Afterwards we sum up all the variables within each event type, and calculate the percentage of each event for each variable.
summarised <- raw_data %>%
mutate(
crop_damage = normalizeValues(CROPDMG, CROPDMGEXP),
prop_damage = normalizeValues(PROPDMG, PROPDMGEXP),
total_damage = crop_damage + prop_damage
) %>%
select(
event_type = EVTYPE,
fatalities = FATALITIES,
injuries = INJURIES,
crop_damage,
prop_damage,
total_damage
) %>%
group_by(event_type) %>%
summarise_all(sum, na.rm = TRUE)
## Warning in normalizeValues(CROPDMG, CROPDMGEXP): NAs introduced by coercion
## Warning in normalizeValues(PROPDMG, PROPDMGEXP): NAs introduced by coercion
total_fatalities <- sum(summarised$fatalities)
total_injuries <- sum(summarised$injuries)
total_damage <- sum(summarised$total_damage)
We choose top 5 most dangerous events for human lives. As it is ethically hard to compare fatalities to injuries, we sort by fatalities only.
top5_health_dmg <- summarised %>%
select(event_type, fatalities, injuries) %>%
arrange(desc(fatalities)) %>%
head(5)
With economical damage we can sort by the sum of property and crop damage:
top5_economic_dmg <- summarised %>%
select(event_type, prop_damage, crop_damage, total_damage) %>%
arrange(desc(total_damage)) %>%
head(5)
Top five most dangerous events for human lives correspond to 67% of all fatalities caused and 76% of all injuries.
Tornados are by far the most dangerous among them, judged either by fatalities or injuries.
top5_health_dmg_displ <- top5_health_dmg %>%
select(event_type, Fatal = fatalities, Injury = injuries) %>%
mutate(
Fatal = 100. * Fatal / total_fatalities,
Injury = 100. * Injury /total_injuries
) %>%
gather('Outcome', 'percentage', -event_type) %>%
mutate(
event_type = factor(event_type, levels = rev(top5_health_dmg$event_type))
)
ggplot(aes(y = percentage, x = event_type, fill = Outcome), data = top5_health_dmg_displ) +
geom_bar(stat = 'identity', position = 'dodge') +
xlab('Event Type') +
ylab('% of cases') +
ylim(0, 100) +
ggtitle('5 most dangerous weather events for human lives') +
coord_flip()
Top five most harmful events for economics correspond to 78% of all economical damage caused.
Floods are by far the most harmful of these, causing the total damage of 138 billion dollars throughout the history of observations.
It is also worth noting that the property damage tends to be more significant than crop damage for all of these events.
top5_economic_dmg_displ <- top5_economic_dmg %>%
select(event_type, Property = prop_damage, Crop = crop_damage) %>%
gather('object', 'damage', -event_type) %>%
mutate(
event_type = factor(event_type, levels = rev(top5_economic_dmg$event_type))
)
ggplot(aes(y = damage / 1e9, x = event_type, fill = object), data = top5_economic_dmg_displ) +
geom_bar(stat = 'identity') +
xlab('Event Type') +
ylab('Damage caused, bln USD') +
ggtitle('5 most harmful weather events for economics') +
coord_flip()
There are many other aspects that could be considered, we just mention them here: